There is a pdf document containing several pages, each of which has text and an image. I need to extract the image. I use python and the fitz library. When trying to extract an image from ‘/XObject’, an image containing 1 pixel is extracted, which is apparently used as a mask, but the image itself cannot be extracted. The content of the page is as follows: 1 0 obj <</Tabs/S/Group<</S/Transparency/Type/Group/CS/DeviceRGB>>/Contents 6 0 R/Type/Page/Resources<</ColorSpace<</CS/DeviceRGB>>/ProcSet [/PDF /Text /ImageB /ImageC /ImageI]/Font<</F1 2 0 R/F2 5 0 R>>/XObject<</Xf1 3 0 R/img0 4 0 R>>>>/Parent 7 0 R/MediaBox[0 0 99 173]>>
blocks contained in /XObject (streams cut out so as not to take up space):
3 0 objn<</Subtype/Form/Filter/FlateDecode/Type/XObject/Matrix [1 0 0 1 0 0]/FormType 1/Resources<</ProcSet [/PDF /Text /ImageB /ImageC /ImageI]>>/BBox[0 0 61.33 61.33]/Length 11510>>stream(...)endstreamnendobjn 4 0 objn<</ColorSpace[/Indexed/DeviceRGB 255(x00x00x00x80x00x00x00x80x00x80x80x00x00x00x80x80x00x80x00x80x80x80x80x80xfcx04x04x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00xc0xc0xc0xffx00x00x00xffx00xffxffx00x00x00xffxffx00xffx00xffxffxffxffxff)]/Mask [8 8 ]/Subtype/Image/Height 1/Filter/FlateDecode/Type/XObject/Width 1/Length 9/BitsPerComponent 8>>streamnxx9cxe3x00x00x00tx00tnendstreamnendobj
method (page.Resources.XObject) outputs the following result: {‘Xf1′: <Stream:len=11510,data=b’xx9cxc5|xcbx0e-;xb2xd4|x7fxc5xf9x02TxcbNxbbxbfx80xc4x80x11 …’>, ‘img0′: Stream:len=9,data=b’xx9cxe3x00x00x00tx00t’}
How do I extract the image contained in ‘Xf1’?
`import fitz
from PIL import Image
import io
pdf_path = '1.pdf'
pdf_document = fitz.open(pdf_path)
for page_num in range(pdf_document.page_count): page = pdf_document[page_num]
text = page.get_text("text")
print(f"text {page_num+1}:
images = page.get_images(full=True)
for img_index, img_info in enumerate(images):
xref = img_info[0]
base_image = pdf_document.extract_image(xref)
image_bytes = base_image["image"]
image = Image.open(io.BytesIO(image_bytes))
if image.mode == 'RGBA':
image = image.convert('RGB')
image.save(f"page_{page_num}_image_{img_index}.jpg")
pdf_document.close()`
Петр Маркелов is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.