I have noticed where I work people are keen on storing information in file names, and parsing the file names.
To me this doesn’t seem to be especially good practice. I already see the occasional issues with scripts globbing for a file, and getting the wrong one because another file matches first.We are also discussing how to get around problems with separators for the fields.
Is it considered bad practice or not?
What are other accepted solutions for retrieving files from a file system based on some type of metadata?
2
Yes I think it’s bad practice. It is subject to all sorts of problems – for example length limits, encoding issues and conflicts due to duplicate data.
Better is to use a “master file” (sometimes called manifest or index) that contains metadata and paths to the files. Or something similar in a database, register or whatnot. Or to put the meta data inside the actual files, at the top level of some datastructure contained in the file in for example JSON or XML.
This is somewhat analogous to the concept of putting information, or namespacing keys in key-value stores. I think this is ok as long as you use it only to namespace and do quick lookups – the key components are not there to provide parsable information. If you need that information, duplicate it into the value (file in the above case).
5
First, metadata is a blurry concept.
That said, many cases of metadata in files already exist:
- version numbers of libraries
- date and time of images, or at least sequence index
- file type, which triggers what application should open the file
- name of your home directory, which must be your session username
Nevertheless, that short list is not an argument in favor of the practice.
Alternatives are:
- handle metadata in the FS level, like Apple old HFS for instance
- put metadata in the file itself, like Exif for images or ID3 for sounds
- put metadata in another file or in a database, like most media managers.
1
It sounds like you need a database.
There are lots of security issues with putting user data in file names. Let’s say that you have a file for each user (“username.txt”). What happens what someone registers the username “../../../../etc/passwd” depends on how you are filtering user input.
Database frameworks will sometimes assist you with sanitizing user input.
3
First, let us agree what a file is. A file is a packaged data with a name that can be transmitted, received, created and deleted with (very close to) atomic operations.
Many file systems (Mac OS, and more recent Linux file systems) implement “forks”, often used to store resources and metadata. This approach to storing metadata was problematic in that traditional network transfer methods, backup and restore methods and file copying methods were inconsistent, especially when the source and destination file systems understood file forks differently.
The file name is used to hold metadata because a) it is always there, b) metadata has always been present in the file name (at least in the use of file extensions), and c) the file name undergoes very little translation when moving between systems (case distinctions, character set limitations, character limitations aside).
So, the file name is visible, portable, and manageable. This is not a bad thing for storing some metadata.
Probably the best solution to address general file metadata is to use a content repository, where the content repository can be configured with the metadata schema to be used for the files. In many cases this is overkill, but, IMHO, is the way to go for serious metadata management.
No… well.. not necessarily.
So long as you have a strict convention and common parsing and validation means (scripts, libraries etc) readily available you are good to go.
Take for example packaging and dependency management systems (Maven, NuGet and the likes). Though many will use specific files for metadata to store the more advanced information, basic information is often part of the file name itself. Relying on strict conventions the file name can contain the most pertinent information about the package : it’s vendor, it’s name, it’s version, it’s type. Sometimes that is all you need… 4 or 5 short pieces of information.
If the metadata is simple then a file naming convention makes perfect sense requiring nothing to put in place. It can be strengthened with very simple tools and scripts, no database needed, no specialised infrastructure just a few scripts and a naming convention.
If nothing out there quite does what you need and your needs are simple i’d start with this.
your requirements outgrow this convention ? extend it with a proper metadata file.
You later need better search for this ? There are already good solutions out there for searching files that get get you to where you need.
It’s not that I dislike databases, quite the contrary they are really powerful and useful but they require some amount of overhead to get going. They need to be installed, backed up, maintained, you will need staff that, if not completely dedicated, will need to dedicate part of their time to this infrastructure. They are also more complex and cryptic to the laymen, loose the dev that set you up and your system will be stuck in time until you find a replacement.
Never underestimate the power of low tech with the proper oversight it can get you a long way.
And by the time you outgrow your low tech solution you will have gathered all the experience and requirements to implement the perfect system for your needs.
3
My take on this is that you may have seen some code somewhere that does sloppy or brittle things with file names, but that does not mean that “storing metadata in filenames” is bad in general.
File names are metadata- they are data about the data in the file, independent of the file data itself. In fact, filenames are so old that they are probably the canonical example of metadata.
If you consider that file extensions are just the end part of the filename, then the filename-as-metadata concept becomes even more unavoidable.
I want to rank metadata. Yes, a filename should contain metadata, because it actually is metadata, or do you refer to files by their inode number?
-
The filename: This is the first info you ever get about the file. Think
of it like a title of a book. Thus, in addition to uniquely identify the file,
it should be short summary of its content. For examplelenna.bmp
indicates
that the file is the face of Lena Forsén, encoded as a Windows bitmap file.
A better name would belena_with_hat.bmp
, because that is what the picture
is, but sincelenna
is an established name, we continue with that name. -
The file headers: These should contain at least contain all information
required to process the data, in this case, a picture. The header would tell
you that the picture is 512×512 pixels, and that each pixel occupies 3 bytes.
Here, some information may be implicit from the fact that it is a Windows
bitmap file. For example, you need to know which one is red, green and blue,
or if pixels are stored as separate planes. Since it is a Windows bitmap
file, it is known that data is stored BGR, BGR, BGR …. It should be noted
that different applications may require different metadata. For example,
to replicate the camera that was used to take the picture, apperture, film
size, and focal length would be needed. -
The index file: Here, you store everything else, that would not fit
into the other two categories. It could be additional notes about the picture
(location, copyright info) although these could go into (2) as well. It could
also contain a list of related pictures.
Notice that sometimes, the filename may contain information found in the file headers. This is not ideal but could be the case when there are multiple versions of the same file. In this particular example, there could be files like
`lenna_512x512.bmp`, `lenna_256x256.bmp`, and `lenna_128x128.bmp`
While these names contains information that are already in the file header, the filenames have to be unique, and they should be descriptive. What makes these files unique? They have different resolution. Thus, it is not a bad choice to include the pixel dimensions in the filename. As an alternative schema, you could use the suffixes _hires
_midres
, and _lowres
, though it would be tricky to come up with names if you have many versions. The ultimate solution here is to store the different versions in the same file, but for some reason, you do have to stick to a file format that does not support that, so you have to do it like this.