For a project, I need to work with varying types of files from some old games and related software–configuration files, saves, resource archives, and so on. The bulk of these aren’t yet documented, nor do tools exist to work with them, so I must reverse-engineer the formats and build my own libraries to handle them.
Although I don’t suppose there’s great demand for most of it, I intend to publish the results of my efforts. Are there any accepted standards for documenting file formats? Looking around, there are several styles in use: some, like the .ZIP File Format Specification, are very wordy; others, like those on XentaxWiki, are much more terse–I find some of them difficult to read; the one I personally like best is this description of the PlayStation 2 Memory Card File System, which includes both detailed descriptive text and several ‘memory maps’ with offsets and such–it also most closely matches my use case. It will vary a little for different formats, but it seems there should be some general principles that I should try to follow.
Edit: I seem not to have explained very well what I want to do. Let me construct an example.
I may have some old piece of software which stores its configuration in a ‘binary’ file–a series of bitfields, integers, strings, and whatnot all glued together and understood by the program, but not human-readable. I decipher this. I wish to document exactly what is the format of this file, in a human-readable way, as a specification for implementing a library to parse and modify this file. Additionally, I’d like this to be easily understood by other people.
There are several ways such a document might be written. The PKZIP example above is very wordy and mostly describes the file format in free text. The PS2 example gives tables of value types, offsets, and sizes, with extensive comments on what they all mean. Many others, like those on XentaxWiki, only list the variable types and sizes, with little or no commentary.
I ask whether there is any standard, akin to a coding style guide, which provides guidance on how to write this kind of documentation. If not, is there any well-known excellent example that I should emulate? If not, can anyone at least summarize some useful advice?
9
A binary file is just a sequence of bits arranged into logical units according to certain rules. These rules are usually called grammar. Grammar can be classified into four types (the Chomsky hierarchy), and for context-free grammars you should use Extended Backus-Naur Form as pointed out by Matt Fenwick in his comment. The interpretation (or semantics) of the sequence stored in the file can be described verbally or with well-annotated sample programs serializing and deserializing the information.
To know more about documenting binary file formats, suggest reading up on e.g. ASN.1 standard.
1
That’s odd because a quick search of file formats brought up a Wikipedia article (List of file formats). It also includes several Video Game Data formats.
List of common file formats of data for video games on systems that support filesystems, most commonly PC games.
It also include a large selection of Video Game Storage Media formats.
List of the most common filename extensions used when a game’s ROM image or storage medium is copied from an original ROM device to an external memory such as hard disk for back up purposes or for making the game playable with an emulator. In the case of cartridge-based software, if the platform specific extension is not used then filename extensions “.rom” or “.bin” are usually used to clarify that the file contains a copy of a content of a ROM. ROM, disk or tape images usually do not consist of a single file or ROM, rather an entire file or ROM structure contained within a single file on the backup medium.
Are there any accepted standards for documenting file formats?
There is no “official” standard anywhere. Since the file formats are made by a company, the company decides on the format for the documentation.
5