I am working on a content type spoof detector for a web application. My issue can be answered by any developer with experience on this subject.
My input is a object, which expose its filename, content_type, and io. The object content_type is determined by a lib called Marcel
, the content_type is based on a reducing of the most specific guessed mime_type using the io, filename, and the file extension.
The issue is that, using the Marcel lib this way, the content_type can be spoofed (that’s why I am building this detector). Using a spoofed jpg with a text/plain content, but a ‘image/jpg’ content_type and a .jpg extension will return ‘image/jpg’.
To solve this, I am analyzing the object io with the linux file
command to determine the ‘real’ content_type. But there is an issue doing things this way. The file
command will sometimes return a content_type that will not be precise enough or can be an alias for the object provided content_type.
For example, for .wmv
files, Marcel
, using the io + filename + extension will be able to determine a video/x-ms-wmv
content_type. Whereas, the file command will return a video/x-ms-asf
content_type. Which corresponds to a kind of parent of video/x-ms-wmv
. Second example, for .avi
files Marcel
will return video/vnd.avi
wherase the file
command will return video/x-msvideo
, which is an alias for this content_type.
In both cases, these content_types are not equal, but both could be deemed as ‘valid’ pairs.
The thing is, doing things with way, I need a kind of mapping of these pair values. The thing I am asking SO, is : is building this content_type mapping an already done task? if not, does anyone know if it’s a complex task? I guess so since they are 1000s of content_types nowadays…
Depending of your answer I might switch to a less precise method by only performing a detection based on the type (ie image/video/audio/…) rather than the whole mime type. This might be enough but I am unsure.
If someone has any experience on this kind of subject, let me know,