I try to read in a PDF in binary format and parse its information.
I can parse most objects, including using zlib
to decompress FlateDecoded data.
But when I try to parse the compressed xref data, I cannot make sense of the decompressed data.
I am using this pdf as a test.
The xref starts at 343543, where the PDF object starts with a length of 5906 using /Filter /FlateDecode.
When I decompress the stream, I get the following values (shown are only the first 100, total length is 12625): b'x00x00x00x00xffx02x00x00x02x00x01x00x00x0fx00x02x00x01x98x0cx02x00t;x10x02x00x00x02x01x02x00x00x02x02x02x00x01x98-x02x00t;x0fx02x00x00x02x03x02x00x00x02x04x02x00x01x98>x02x00t;x0ex02x00x00x02x05x02x00x00x02x06x02x00x01x98Vx02x00t;rx02x00x00x02x07x02x00x00x02x08x02x00x01x98['
I have used mutool to clean the pdf, where I see the cleaned and decompressed xref as a benchmark as
xref
0 2525
0000000002 00256 f
0000000016 00000 n
0000000203 00001 f
0000000069 00000 n
0000000134 00000 n
0000000215 00000 n
0000000262 00000 n
0000000321 00000 n
MWE
To replicate the values, I use this python code
import zlib
file = "pdf/dplyr.pdf"
with open(file, "rb") as f:
data = f.read()
xref_text = data[slice(343543, 349686)]
stream = xref_text[slice(220, 6127)]
stream = zlib.decompress(stream)
stream[:100]
and mutool clean -d dplyr.pdf dplyr_clean.pdf
and then in line 37521ff I see the parsed xref data as shown above.