I’m building a proprietary file format for an application I wrote in C# .NET to store save information and perhaps down the line project assets. Is there a standard on how to do this in any way? I was simply going to Serialize
my objects into binary and create a header that would tell me how to parse the file. Is this a bad approach?
3
The most straight-forward method is probably to serialize your structure to XML using the XMLSerializer
class. You probably wouldn’t need to create a separate header and body structure – but serialize all assets into XML. This allows you to easily inspect / edit your file structure outside of your own program, and is easily manageable.
However, if your file structure is really complex, containing many different assets of different types, such that serializing the entire structure to XML is too burdensome, you might look at serializing each asset separately and compiling them into a single package using the Packaging
library in C#. This is essentially how .docx, .xslx, .pptx, and other office file formats are constructed.
5
From someone who has had to parse a lot of file formats, I have opinions on this from a different point of view to most.
-
Make the magic number very unique so that people’s file format detectors for other formats don’t misidentify it as yours. If you use binary, allocate 8 or 16 randomly-generated bytes at the start of a binary format for the magic number. If you use XML, allocate a proper namespace in your domain so that it can’t clash with other people. If you use JSON, god help you. Maybe someone has sorted out a solution for that abomination of a format by now.
-
Plan for backwards compatibility. Store the version number of the format somehow so that later versions of your software can deal with differences.
-
If the file can be large, or there are sections of it which people might want to skip over for some reason, make sure there is a nice way to do this. XML, JSON and most other text formats are particularly terrible for this, because they force the reader to parse all the data between the start and end element even if they don’t care about it. EBML is somewhat better because it stores the length of elements, allowing you to skip all the way to the end. If you make a custom binary format, there is a fairly common design where you store a chunk identifier and a length as the first thing in the header, and then the reader can skip the entire chunk.
-
Store all strings in UTF-8.
-
If you care about long-term extensibility, store all integers in a variable-length form.
-
Checksums are nice because it allows the reader to immediately abort on invalid data, instead of potentially stepping into sections of the file which could produce confusing results.
5
Well, there are times what you describe can be a very bad approach. This is assuming when you say ‘serialize’ you’re talking about using a language/framework’s ability to simply take an object and output directly to some sort of binary stream. The problem is class structures change over the years. Will you be able to reload a file made in a previous version of your app if all your classes change in a newer one?
For long term stability of a file format, I’ve found it better to roll up your sleeves a little bit now and specifically write your own ‘serializing’/’streaming’ methods within your classes. ie, manually handle the writing of values to a stream. Write a header as you state that describes the format version, and then the data you want saved in the order you want it in. On the reading side, handling different versions of the file format becomes a lot easier.
The other option of course is XML or JSON. Not necessarily the greatest for binary heavy content, but simple and human readable… a big plus for long term viability.
5
I was simply going to Serialize my objects into binary and create a header that would tell me how to parse the file. Is this a bad approach?
From someone who’s been on the receiving end of someone else doing this …
YES, it’s a Bad Idea.
We had a very old application, written in a now-obsolete technology that did exactly this – dumped the object out of memory and wrote it into a file. Easy to code, nice quick solution for the Developers. Two decades and some down the line, when that technology got trashed on security grounds, we were left with thousands of these binary nightmare files lying around, still used by the business, but with no way to edit them.
Picking the file “format” apart and interpreting it into a replacement application was … “Fun”.
I would also love to hear answers to this question from people with years more experience than myself.
I have personally implemented several file formats for my work, and I’ve moved over to using an XML file format. My requirements and hardware that I interact with change all the time, and there is no telling what I will need to add to the format in the future. One of XML’s primary advantages is that it is semi-structured. For this reason, I generally avoid automatic XML Serialization that .NET provides because I believe it forces it to expect an exact format.
My goal was to create an XML format that allowed for new elements and attributes to be added in the future and for the order of the tags to not matter whenever possible. If you are sure that you can load your entire file into memory then XPATH is probably a good choice.
If you are dealing with particularly large files, or for other reasons cannot load the file all at once, then you are probably left with using an XmlStreamReader and scanning for known elements and recursing into those elements with ReadSubtree and scanning again…
2