As part of my program, I’m trying to use the pdfminer
third-party library in Python to open and read the PDF pages, and then use regular expressions to search for specific patterns. I’m also using multiprocessing
to parallelize this, because I have a large number of PDFs to analyze. Each process should be handling a single PDF.
I have this code to set up multiprocessing:
<code>def process_theme_files(theme_dir: Path, processed_files: Optional[Set[str]] = None, theme_processed_files: Optional[Set[str]] = None) -> Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]:
Processes files of a specific theme in a multiprocessed manner.
- theme_dir (Path): Path object pointing to the theme directory.
- processed_files (set, optional): Set of globally processed files. Defaults to None.
- theme_processed_files (set, optional): Set of files processed in the current theme. Defaults to None.
- Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]: A tuple containing a list of tuples
containing extracted information from the theme files and a list of exceptions encountered during the processing.
# Number of processes to be used (can be adjusted as needed)
num_processes = multiprocessing.cpu_count()
# Initialize processed_files and theme_processed_files as empty sets if not provided
if processed_files is None:
if theme_processed_files is None:
theme_processed_files = set()
# Get PDF files in the theme directory
pdf_files = list(theme_dir.glob('**/*.pdf'))
with tqdm(total=len(pdf_files), desc='Processing PDFs', unit='file') as pbar:
# Process PDF files in parallel using ProcessPoolExecutor
with concurrent.futures.ProcessPoolExecutor(max_workers=num_processes) as executor:
# Map the process_file function to each PDF file in the list
future_to_file = {executor.submit(process_file, pdf_file): pdf_file for pdf_file in pdf_files}
# Iterate over results as they become available
for future in concurrent.futures.as_completed(future_to_file):
pdf_file = future_to_file[future]
# Get the result of the task
file_results, file_exceptions = future.result()
# Extend the results list
results.extend(file_results)
# Append specific exceptions to the exceptions list
exceptions.extend(file_exceptions)
except FileNotFoundError as fnfe:
exceptions.append(f"File not found: {fnfe.filename}")
# Capture and log the generic exception
exceptions.append(f"Error processing file '{pdf_file}': {e}")
# Update the progress bar
return results, exceptions
<code>def process_theme_files(theme_dir: Path, processed_files: Optional[Set[str]] = None, theme_processed_files: Optional[Set[str]] = None) -> Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]:
"""
Processes files of a specific theme in a multiprocessed manner.
Parameters:
- theme_dir (Path): Path object pointing to the theme directory.
- processed_files (set, optional): Set of globally processed files. Defaults to None.
- theme_processed_files (set, optional): Set of files processed in the current theme. Defaults to None.
Returns:
- Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]: A tuple containing a list of tuples
containing extracted information from the theme files and a list of exceptions encountered during the processing.
"""
results = []
exceptions = []
# Number of processes to be used (can be adjusted as needed)
num_processes = multiprocessing.cpu_count()
# Initialize processed_files and theme_processed_files as empty sets if not provided
if processed_files is None:
processed_files = set()
if theme_processed_files is None:
theme_processed_files = set()
# Get PDF files in the theme directory
pdf_files = list(theme_dir.glob('**/*.pdf'))
# Create progress bar
with tqdm(total=len(pdf_files), desc='Processing PDFs', unit='file') as pbar:
# Process PDF files in parallel using ProcessPoolExecutor
with concurrent.futures.ProcessPoolExecutor(max_workers=num_processes) as executor:
# Map the process_file function to each PDF file in the list
future_to_file = {executor.submit(process_file, pdf_file): pdf_file for pdf_file in pdf_files}
# Iterate over results as they become available
for future in concurrent.futures.as_completed(future_to_file):
pdf_file = future_to_file[future]
try:
# Get the result of the task
file_results, file_exceptions = future.result()
# Extend the results list
results.extend(file_results)
# Append specific exceptions to the exceptions list
exceptions.extend(file_exceptions)
except FileNotFoundError as fnfe:
exceptions.append(f"File not found: {fnfe.filename}")
except Exception as e:
# Capture and log the generic exception
exceptions.append(f"Error processing file '{pdf_file}': {e}")
# Update the progress bar
pbar.update(1)
return results, exceptions
</code>
def process_theme_files(theme_dir: Path, processed_files: Optional[Set[str]] = None, theme_processed_files: Optional[Set[str]] = None) -> Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]:
"""
Processes files of a specific theme in a multiprocessed manner.
Parameters:
- theme_dir (Path): Path object pointing to the theme directory.
- processed_files (set, optional): Set of globally processed files. Defaults to None.
- theme_processed_files (set, optional): Set of files processed in the current theme. Defaults to None.
Returns:
- Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]: A tuple containing a list of tuples
containing extracted information from the theme files and a list of exceptions encountered during the processing.
"""
results = []
exceptions = []
# Number of processes to be used (can be adjusted as needed)
num_processes = multiprocessing.cpu_count()
# Initialize processed_files and theme_processed_files as empty sets if not provided
if processed_files is None:
processed_files = set()
if theme_processed_files is None:
theme_processed_files = set()
# Get PDF files in the theme directory
pdf_files = list(theme_dir.glob('**/*.pdf'))
# Create progress bar
with tqdm(total=len(pdf_files), desc='Processing PDFs', unit='file') as pbar:
# Process PDF files in parallel using ProcessPoolExecutor
with concurrent.futures.ProcessPoolExecutor(max_workers=num_processes) as executor:
# Map the process_file function to each PDF file in the list
future_to_file = {executor.submit(process_file, pdf_file): pdf_file for pdf_file in pdf_files}
# Iterate over results as they become available
for future in concurrent.futures.as_completed(future_to_file):
pdf_file = future_to_file[future]
try:
# Get the result of the task
file_results, file_exceptions = future.result()
# Extend the results list
results.extend(file_results)
# Append specific exceptions to the exceptions list
exceptions.extend(file_exceptions)
except FileNotFoundError as fnfe:
exceptions.append(f"File not found: {fnfe.filename}")
except Exception as e:
# Capture and log the generic exception
exceptions.append(f"Error processing file '{pdf_file}': {e}")
# Update the progress bar
pbar.update(1)
return results, exceptions
And this code for processing individual files:
<code>def process_file(file_path: Path):
Process a PDF file to extract text and information.
- file_path (Path): Path object representing the location of the PDF file.
- Tuple[List, List]: A tuple containing two lists:
1. List of extracted results.
2. List of encountered exceptions during processing.
- FileNotFoundError: If the specified file_path does not exist.
- Exception: For any other unexpected errors during processing.
results = [] # List to store extracted information from each page
exceptions = [] # List to store exceptions encountered during processing
# Check the size of the PDF file
pdf_size_bytes = os.path.getsize(file_path)
pdf_size_mb = pdf_size_bytes / (1024 * 1024)
# Check if the PDF file size exceeds the maximum allowed size
if pdf_size_mb > MAX_PDF_SIZE_MB:
exceptions.append((file_path, f"The file exceeds the maximum allowed size of {MAX_PDF_SIZE_MB} MB."))
print(f'{file_path} - Exceeds maximum allowed size of {MAX_PDF_SIZE_MB} MB.')
return results, exceptions
# Open the PDF file and read its content into a BytesIO buffer
with file_path.open('rb') as file:
pdf_data_buffer = BytesIO(file.read())
# Iterate through each page of the PDF
for page_number, page in enumerate(PDFPage.get_pages(pdf_data_buffer, check_extractable=True)):
# Extract text from the current page
page_text = extract_text(pdf_data_buffer, page_numbers=[page_number])
# Process the extracted text to extract information
page_results, page_exceptions = extract_information(page_text, file_path.name, page_number + 1) # Page numbers are 1-based
# Extend results and exceptions lists with page-specific results and exceptions
results.extend(page_results)
exceptions.extend(page_exceptions)
except FileNotFoundError as e:[](https://i.sstatic.net/2jJln4M6.png)
# Handle case where the file does not exist
print(f"FileNotFoundError: {e}")
# Handle any other unexpected exceptions
return results, exceptions
<code>def process_file(file_path: Path):
"""
Process a PDF file to extract text and information.
Args:
- file_path (Path): Path object representing the location of the PDF file.
Returns:
- Tuple[List, List]: A tuple containing two lists:
1. List of extracted results.
2. List of encountered exceptions during processing.
Raises:
- FileNotFoundError: If the specified file_path does not exist.
- Exception: For any other unexpected errors during processing.
"""
results = [] # List to store extracted information from each page
exceptions = [] # List to store exceptions encountered during processing
try:
# Check the size of the PDF file
pdf_size_bytes = os.path.getsize(file_path)
pdf_size_mb = pdf_size_bytes / (1024 * 1024)
# Check if the PDF file size exceeds the maximum allowed size
if pdf_size_mb > MAX_PDF_SIZE_MB:
exceptions.append((file_path, f"The file exceeds the maximum allowed size of {MAX_PDF_SIZE_MB} MB."))
print(f'{file_path} - Exceeds maximum allowed size of {MAX_PDF_SIZE_MB} MB.')
return results, exceptions
# Open the PDF file and read its content into a BytesIO buffer
with file_path.open('rb') as file:
pdf_data_buffer = BytesIO(file.read())
# Iterate through each page of the PDF
for page_number, page in enumerate(PDFPage.get_pages(pdf_data_buffer, check_extractable=True)):
# Extract text from the current page
page_text = extract_text(pdf_data_buffer, page_numbers=[page_number])
# Process the extracted text to extract information
page_results, page_exceptions = extract_information(page_text, file_path.name, page_number + 1) # Page numbers are 1-based
# Extend results and exceptions lists with page-specific results and exceptions
results.extend(page_results)
exceptions.extend(page_exceptions)
except FileNotFoundError as e:[](https://i.sstatic.net/2jJln4M6.png)
# Handle case where the file does not exist
exceptions.append(e)
print(f"FileNotFoundError: {e}")
raise
except Exception as e:
# Handle any other unexpected exceptions
exceptions.append(e)
print(f"Exception: {e}")
raise
return results, exceptions
</code>
def process_file(file_path: Path):
"""
Process a PDF file to extract text and information.
Args:
- file_path (Path): Path object representing the location of the PDF file.
Returns:
- Tuple[List, List]: A tuple containing two lists:
1. List of extracted results.
2. List of encountered exceptions during processing.
Raises:
- FileNotFoundError: If the specified file_path does not exist.
- Exception: For any other unexpected errors during processing.
"""
results = [] # List to store extracted information from each page
exceptions = [] # List to store exceptions encountered during processing
try:
# Check the size of the PDF file
pdf_size_bytes = os.path.getsize(file_path)
pdf_size_mb = pdf_size_bytes / (1024 * 1024)
# Check if the PDF file size exceeds the maximum allowed size
if pdf_size_mb > MAX_PDF_SIZE_MB:
exceptions.append((file_path, f"The file exceeds the maximum allowed size of {MAX_PDF_SIZE_MB} MB."))
print(f'{file_path} - Exceeds maximum allowed size of {MAX_PDF_SIZE_MB} MB.')
return results, exceptions
# Open the PDF file and read its content into a BytesIO buffer
with file_path.open('rb') as file:
pdf_data_buffer = BytesIO(file.read())
# Iterate through each page of the PDF
for page_number, page in enumerate(PDFPage.get_pages(pdf_data_buffer, check_extractable=True)):
# Extract text from the current page
page_text = extract_text(pdf_data_buffer, page_numbers=[page_number])
# Process the extracted text to extract information
page_results, page_exceptions = extract_information(page_text, file_path.name, page_number + 1) # Page numbers are 1-based
# Extend results and exceptions lists with page-specific results and exceptions
results.extend(page_results)
exceptions.extend(page_exceptions)
except FileNotFoundError as e:[](https://i.sstatic.net/2jJln4M6.png)
# Handle case where the file does not exist
exceptions.append(e)
print(f"FileNotFoundError: {e}")
raise
except Exception as e:
# Handle any other unexpected exceptions
exceptions.append(e)
print(f"Exception: {e}")
raise
return results, exceptions
The problem is that I run out of RAM, even with 32 GB installed:
Through my research, I learned that PDFs cannot be read randomly; they must be read sequentially from the beginning to the end of the file, which is how I implemented it.
Some of my PDFs are around 100 MB in size, never exceeding 200 MB, and some are quite long (1000 pages) with many images. Since I have to read all the pages when I process a PDF, the only workaround I could find was to limit the size of the PDFs I read to less than 100 MB. I also can’t think of a way to restrict the page count – because to determine the number of pages, I need to open and read the file.
How can I limit RAM usage in this program?