I’ve recently read three separate books on algorithms and data structures, tcp/ip socket programming, and programming with memory. The book about memory briefly discussed the topic of serializing data structures for the purposes of storing it to disk, or sending it across a network. I can’t help but wonder why the the other two books didn’t discuss serialization at all.
After an unsuccessful web/book search I’m left wondering where I can find a good book/paper/tutorial on serializing data structures in C? Where or how did you learn it?
3
C has no native support for serializing structures, so you’re on your own.
The first order approximation is (as stated in other replies) to define it
for primitive types, and apply it recursively to larger structures.
However, there are lots of devilish details that have to be addressed beyond the
simple concept. To name a few:
- endian order of integers, and sizes of various common types of integers depending
on machine architectures. This isn’t much of a problem if all the consumers of serializatin are the same binary, but consider reading data produced by a 32 bit PPC Mac on a 64 Bit Windows machine, or if a “long” is 32 or 64 bits. - Different representations for common data types. Color bitmaps have 3 components on a PC, but 4 components, in a different order, on Macs
- Representation and precision of floating point numbers.
- If strings with the same letters are idential or only similar.
- dealing with cyclic or self-referential data structures.
Have a look at the work Google has done with Protocol Buffers.
You write a .proto file like this:
message Person {
required int32 id = 1;
required string name = 2;
optional string email = 3;
}
Then you compile it with protoc, the protocol buffer compiler, to produce code in C++, Java, or Python.
Then, if you are using C++, you use that code like this:
Person person;
person.set_id(123);
person.set_name("Bob");
person.set_email("[email protected]");
fstream out("person.pb", ios::out | ios::binary | ios::trunc);
person.SerializeToOstream(&out);
out.close();
You can examine the SerializeToOstream
method to understand how Google generates the serialization code. Yes, it’s C++ code, but it should still be pretty close to C code.
2
It probably wasn’t covered in your books because there are so many possible variations, all different trade-offs between runtime speed, ease of use, and portability.
For example, 1 and 3 are pretty much opposite ends of the spectrum, and 2 just illustrates the number of small variations possible between them:
-
Don’t serialize, just copy the raw bytes
- this leaves you dependent on architecture and compiler details (endianness, padding & alignment, floating-point representation), so is very non-portable
- it does no (extra) work at all, so is very fast
- it doesn’t handle any form of indirection, so only works for flat, self-contained POD structures
- it’s an absolute pain to decode, if you need it for debugging
-
as 1, but explicitly specify padding, alignment and endianness
- this is now portable, at the fairly minimal extra cost of byte-order conversions (on hosts with different native byte order to that specified) and the requirement to force appropriate alignment (which may not be optimal on all architectures)
-
serialize everything to a text format like XML or JSON
- this is very portable
- and very slow (relatively)
- you can make it handle indirection, including circular references, if you need
- it’s really easy to read serialized messages directly
Oh, and I learnt about this stuff (all of the above, plus CORBA, ASN.1 and ProtocolBuffers) by implementing existing documented protocols.
If your main focus is on portability rather than speed, something towards the #3 end of the spectrum is probably a better fit. As for how you build that in the first place, it’s almost more a question about reflection.
The Wikipedia article Serialization covers the topic fairly well, though oddly it does not mention ASN.1 which is a widely hated, but extremely well defined and well known standard for describing efficient data serialization protocols. ASN compilers typically generate code (e.g. C code) for encoding and decoding the described data structures in a canonical way.
BTW, the endian issue can be trivially dealt with in C, as Rob Pike has shown nicely in his article The Byte Order Fallacy, though some C compilers don’t yet always generate the most optimal object code when using this technique.
Serialization is usually pretty simple and recursive. You just figure out what fields you need to send before the actual data so you can reconstruct the structure at the other end. Issues would mainly be endianness. Learn it by trial and error, I doubt you need a book for this sort of thing.
2