Есть pdf документ, содержащий несколько страниц, на каждой из которых есть текст и изображение. Мне необходимо извлечь изображение. Использую python и библиотеку fitz. При попытках извлечь изображение из ‘/XObject’ извлекается изображение, содержащие 1 пиксель, используемое, видимо, как маска, а само изображение извлечь не получается. Содержание страницы следующее:
1 0 obj
<</Tabs/S/Group<</S/Transparency/Type/Group/CS/DeviceRGB>>/Contents 6 0 R/Type/Page/Resources<</ColorSpace<</CS/DeviceRGB>>/ProcSet [/PDF /Text /ImageB /ImageC /ImageI]/Font<</F1 2 0 R/F2 5 0 R>>/XObject<</Xf1 3 0 R/img0 4 0 R>>>>/Parent 7 0 R/MediaBox[0 0 99 173]>>
блоки содержащиеся в /XObject (потоки вырезал, чтобы не занимать место):
3 0 objn<</Subtype/Form/Filter/FlateDecode/Type/XObject/Matrix [1 0 0 1 0 0]/FormType 1/Resources<</ProcSet [/PDF /Text /ImageB /ImageC /ImageI]>>/BBox[0 0 61.33 61.33]/Length 11510>>stream(…)endstreamnendobjn
4 0 objn<</ColorSpace[/Indexed/DeviceRGB 255(x00x00x00x80x00x00x00x80x00x80x80x00x00x00x80x80x00x80x00x80x80x80x80x80xfcx04x04x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00x00xc0xc0xc0xffx00x00x00xffx00xffxffx00x00x00xffxffx00xffx00xffxffxffxffxff)]/Mask [8 8 ]/Subtype/Image/Height 1/Filter/FlateDecode/Type/XObject/Width 1/Length 9/BitsPerComponent 8>>streamnxx9cxe3x00x00x00tx00tnendstreamnendobj
(page.Resources.XObject) выдает следующий результат:
{‘Xf1′: <Stream:len=11510,data=b’xx9cxc5|xcbx0e-;xb2xd4|x7fxc5xf9x02TxcbNxbb\xbfx80xc4x80x11 …’>,
‘img0′: Stream:len=9,data=b’xx9cxe3x00x00x00tx00t’}
Как мне извлечь изображение, содержащееся в ‘Xf1’?
`import fitz
from PIL import Image
import io
pdf_path = ‘1.pdf’
pdf_document = fitz.open(pdf_path)
for page_num in range(pdf_document.page_count):
page = pdf_document[page_num]
# Извлекаем текст из каждой страницы
text = page.get_text("text")
print(f"Текст на странице {page_num+1}: {text}")
# Извлекаем изображения из каждой страницы
images = page.get_images(full=True)
for img_index, img_info in enumerate(images):
xref = img_info[0]
base_image = pdf_document.extract_image(xref)
image_bytes = base_image["image"]
image = Image.open(io.BytesIO(image_bytes))
if image.mode == 'RGBA':
image = image.convert('RGB')
image.save(f"page_{page_num}_image_{img_index}.jpg")
pdf_document.close()`
Петр Маркелов is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.