I would like to write code that extracts a file from a ZIP file.
Usually I would walk through the file expecting a local file header or a central directory start/end signature. The local file header has a length of 30
+ file name length
+ extra field length
bytes, where file name length
and extra field length
are 16-bit numbers stored at position 26 and 28 of the header. The file data will start after the header and will be compressed size
bytes long, which is stored as a 32 bit number at position 18 of the header. I would expect the next signature after the file data.
However, there are some ZIP files where compressed size
is 0. These are usually files that are generated as a stream, so when the local file header is generated, the compressed size is not known yet. According to the specification, such files are indicated by bit 3 of the local file header general purpose bit flag
being set. For such files, the compressed content is followed by a data descriptor, which contains the checksum and size of the file. The next local file or the central directory would only follow after that data descriptor.
I’m having trouble understanding how to detect the position of the data signature so that I know where to start looking for the next signature. I am really confused why this does not seem to be explained anywhere, as it seems to be a really obvious question.
It seems like with compression method 8 (deflate), the compressed data is stored in blocks, and there is a marker that indicates the last block, so the compression method itself provides a way to detect the end of the compressed data. However, this is not the case for all compression methods. In particular, compression method 0 (no compression) does not seem to provide such a way because the uncompressed data is stored as is. I can also not just look for the next signature in the data, because the uncompressed data of the file itself might contain a signature. I can also not wait for the central directory and look up the file size there, since I don’t know how to find the position of the central directory.
How can I parse a file that has bit 3 set and thus does not contain the compressed file sizes?
since I don’t know how to find the position of the central directory.
Then learn how. You search from the end of the zip file for the end-of-central-directory record. That will either give you the offset of the central directory, or if that offset exceeds 232-2, then the end record 32-bit offset will contain 232-1, and will be immediately preceded by a zip64 end-of-central-directory-locator. That will have the 64-bit offset of the zip64 end-of-central-directory record, which will have the 64-bit offset of the central directory.
The central directory will contain all of the lengths.
There should not be a zip file that sets the flag bit 3 for a compressed data format that is not self-terminating, e.g. stored. You should be able to read the zip file as a stream and not have to get lengths from the central directory.