I’m using PyMuPDF to process a PDF and then re-save it, but the resulting file loses the original page orientations and crop boxes. Some pages in the original PDF are larger or differently oriented (e.g., rotated or with custom crop regions), but after calling pdf.save(), all pages become uniformly sized and oriented.
Example:
import pymupdf
pdf = pymupdf.open(pdf_path, filetype="pdf")
pdf.save("pymupdf-exported.pdf")
Original File: https://static.vitra.com/media/asset/8664580/storage/master/download/Factbook%2520Electrification%25202024-EN.pdf
Exported PDF: https://drive.google.com/file/d/1mVzAoS8OWHRyM2X_BDABoCCaxAAnrL1x/view?usp=sharing
How can I preserve the original page orientations and crop boxes when using PyMuPDF, so that the re-saved PDF matches the original layout?
My endgoal:
def convert_pdf_to_image_arrays(pdf_path: str, zoom: int, dpi: int) -> list[np.ndarray]:
"""
Convert a PDF to high-resolution image arrays, preserving color fidelity.
:param pdf_path: Path to the PDF file.
:param dpi: DPI (dots per inch) for rendering high-resolution images.
:return: List of NumPy arrays representing images of the PDF pages.
"""
pdf = pymupdf.open(pdf_path, filetype="pdf")
images: list[np.ndarray] = [] # delete a page range from the document
for page in pdf:
# Render the page to a pixmap with the desired DPI
pix = page.get_pixmap(dpi=dpi)
# Convert the raw pixel data to a PIL image (preserving color accuracy)
img_pil = Image.frombytes(
mode="RGB" if pix.n == 3 else "RGBA",
size=(pix.width, pix.height),
data=pix.samples,
)
# Convert the PIL image to a NumPy array
img_array = np.array(img_pil)
# Convert RGBA to BGR if the image has an alpha channel
if pix.n == 4:
img_array = cv2.cvtColor(img_array, cv2.COLOR_RGBA2BGR)
else:
img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2BGR)
images.append(img_array)
pdf.close()
return images