Python PDF searcher overflows the RAM

As part of my program, I’m trying to use the pdfminer third-party library in Python to open and read the PDF pages, and then use regular expressions to search for specific patterns. I’m also using multiprocessing to parallelize this, because I have a large number of PDFs to analyze. Each process should be handling a single PDF.

I have this code to set up multiprocessing:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>def process_theme_files(theme_dir: Path, processed_files: Optional[Set[str]] = None, theme_processed_files: Optional[Set[str]] = None) -> Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]:
"""
Processes files of a specific theme in a multiprocessed manner.
Parameters:
- theme_dir (Path): Path object pointing to the theme directory.
- processed_files (set, optional): Set of globally processed files. Defaults to None.
- theme_processed_files (set, optional): Set of files processed in the current theme. Defaults to None.
Returns:
- Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]: A tuple containing a list of tuples
containing extracted information from the theme files and a list of exceptions encountered during the processing.
"""
results = []
exceptions = []
# Number of processes to be used (can be adjusted as needed)
num_processes = multiprocessing.cpu_count()
# Initialize processed_files and theme_processed_files as empty sets if not provided
if processed_files is None:
processed_files = set()
if theme_processed_files is None:
theme_processed_files = set()
# Get PDF files in the theme directory
pdf_files = list(theme_dir.glob('**/*.pdf'))
# Create progress bar
with tqdm(total=len(pdf_files), desc='Processing PDFs', unit='file') as pbar:
# Process PDF files in parallel using ProcessPoolExecutor
with concurrent.futures.ProcessPoolExecutor(max_workers=num_processes) as executor:
# Map the process_file function to each PDF file in the list
future_to_file = {executor.submit(process_file, pdf_file): pdf_file for pdf_file in pdf_files}
# Iterate over results as they become available
for future in concurrent.futures.as_completed(future_to_file):
pdf_file = future_to_file[future]
try:
# Get the result of the task
file_results, file_exceptions = future.result()
# Extend the results list
results.extend(file_results)
# Append specific exceptions to the exceptions list
exceptions.extend(file_exceptions)
except FileNotFoundError as fnfe:
exceptions.append(f"File not found: {fnfe.filename}")
except Exception as e:
# Capture and log the generic exception
exceptions.append(f"Error processing file '{pdf_file}': {e}")
# Update the progress bar
pbar.update(1)
return results, exceptions
</code>
<code>def process_theme_files(theme_dir: Path, processed_files: Optional[Set[str]] = None, theme_processed_files: Optional[Set[str]] = None) -> Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]: """ Processes files of a specific theme in a multiprocessed manner. Parameters: - theme_dir (Path): Path object pointing to the theme directory. - processed_files (set, optional): Set of globally processed files. Defaults to None. - theme_processed_files (set, optional): Set of files processed in the current theme. Defaults to None. Returns: - Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]: A tuple containing a list of tuples containing extracted information from the theme files and a list of exceptions encountered during the processing. """ results = [] exceptions = [] # Number of processes to be used (can be adjusted as needed) num_processes = multiprocessing.cpu_count() # Initialize processed_files and theme_processed_files as empty sets if not provided if processed_files is None: processed_files = set() if theme_processed_files is None: theme_processed_files = set() # Get PDF files in the theme directory pdf_files = list(theme_dir.glob('**/*.pdf')) # Create progress bar with tqdm(total=len(pdf_files), desc='Processing PDFs', unit='file') as pbar: # Process PDF files in parallel using ProcessPoolExecutor with concurrent.futures.ProcessPoolExecutor(max_workers=num_processes) as executor: # Map the process_file function to each PDF file in the list future_to_file = {executor.submit(process_file, pdf_file): pdf_file for pdf_file in pdf_files} # Iterate over results as they become available for future in concurrent.futures.as_completed(future_to_file): pdf_file = future_to_file[future] try: # Get the result of the task file_results, file_exceptions = future.result() # Extend the results list results.extend(file_results) # Append specific exceptions to the exceptions list exceptions.extend(file_exceptions) except FileNotFoundError as fnfe: exceptions.append(f"File not found: {fnfe.filename}") except Exception as e: # Capture and log the generic exception exceptions.append(f"Error processing file '{pdf_file}': {e}") # Update the progress bar pbar.update(1) return results, exceptions </code>
def process_theme_files(theme_dir: Path, processed_files: Optional[Set[str]] = None, theme_processed_files: Optional[Set[str]] = None) -> Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]:
    """
    Processes files of a specific theme in a multiprocessed manner.

    Parameters:
    - theme_dir (Path): Path object pointing to the theme directory.
    - processed_files (set, optional): Set of globally processed files. Defaults to None.
    - theme_processed_files (set, optional): Set of files processed in the current theme. Defaults to None.

    Returns:
    - Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]: A tuple containing a list of tuples
      containing extracted information from the theme files and a list of exceptions encountered during the processing.
    """
    results = []
    exceptions = []
    
    # Number of processes to be used (can be adjusted as needed)
    num_processes = multiprocessing.cpu_count()

    # Initialize processed_files and theme_processed_files as empty sets if not provided
    if processed_files is None:
        processed_files = set()
    if theme_processed_files is None:
        theme_processed_files = set()

    # Get PDF files in the theme directory
    pdf_files = list(theme_dir.glob('**/*.pdf'))

    # Create progress bar
    with tqdm(total=len(pdf_files), desc='Processing PDFs', unit='file') as pbar:
        # Process PDF files in parallel using ProcessPoolExecutor
        with concurrent.futures.ProcessPoolExecutor(max_workers=num_processes) as executor:
            # Map the process_file function to each PDF file in the list
            future_to_file = {executor.submit(process_file, pdf_file): pdf_file for pdf_file in pdf_files}
            
            # Iterate over results as they become available
            for future in concurrent.futures.as_completed(future_to_file):
                pdf_file = future_to_file[future]
                try:
                    # Get the result of the task
                    file_results, file_exceptions = future.result()
                    # Extend the results list
                    results.extend(file_results)
                    # Append specific exceptions to the exceptions list
                    exceptions.extend(file_exceptions)
                except FileNotFoundError as fnfe:
                    exceptions.append(f"File not found: {fnfe.filename}")
                except Exception as e:
                    # Capture and log the generic exception
                    exceptions.append(f"Error processing file '{pdf_file}': {e}")
                
                # Update the progress bar
                pbar.update(1)

    return results, exceptions

And this code for processing individual files:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>def process_file(file_path: Path):
"""
Process a PDF file to extract text and information.
Args:
- file_path (Path): Path object representing the location of the PDF file.
Returns:
- Tuple[List, List]: A tuple containing two lists:
1. List of extracted results.
2. List of encountered exceptions during processing.
Raises:
- FileNotFoundError: If the specified file_path does not exist.
- Exception: For any other unexpected errors during processing.
"""
results = [] # List to store extracted information from each page
exceptions = [] # List to store exceptions encountered during processing
try:
# Check the size of the PDF file
pdf_size_bytes = os.path.getsize(file_path)
pdf_size_mb = pdf_size_bytes / (1024 * 1024)
# Check if the PDF file size exceeds the maximum allowed size
if pdf_size_mb > MAX_PDF_SIZE_MB:
exceptions.append((file_path, f"The file exceeds the maximum allowed size of {MAX_PDF_SIZE_MB} MB."))
print(f'{file_path} - Exceeds maximum allowed size of {MAX_PDF_SIZE_MB} MB.')
return results, exceptions
# Open the PDF file and read its content into a BytesIO buffer
with file_path.open('rb') as file:
pdf_data_buffer = BytesIO(file.read())
# Iterate through each page of the PDF
for page_number, page in enumerate(PDFPage.get_pages(pdf_data_buffer, check_extractable=True)):
# Extract text from the current page
page_text = extract_text(pdf_data_buffer, page_numbers=[page_number])
# Process the extracted text to extract information
page_results, page_exceptions = extract_information(page_text, file_path.name, page_number + 1) # Page numbers are 1-based
# Extend results and exceptions lists with page-specific results and exceptions
results.extend(page_results)
exceptions.extend(page_exceptions)
except FileNotFoundError as e:[![enter image description here](https://i.sstatic.net/2jJln4M6.png)](https://i.sstatic.net/2jJln4M6.png)
# Handle case where the file does not exist
exceptions.append(e)
print(f"FileNotFoundError: {e}")
raise
except Exception as e:
# Handle any other unexpected exceptions
exceptions.append(e)
print(f"Exception: {e}")
raise
return results, exceptions
</code>
<code>def process_file(file_path: Path): """ Process a PDF file to extract text and information. Args: - file_path (Path): Path object representing the location of the PDF file. Returns: - Tuple[List, List]: A tuple containing two lists: 1. List of extracted results. 2. List of encountered exceptions during processing. Raises: - FileNotFoundError: If the specified file_path does not exist. - Exception: For any other unexpected errors during processing. """ results = [] # List to store extracted information from each page exceptions = [] # List to store exceptions encountered during processing try: # Check the size of the PDF file pdf_size_bytes = os.path.getsize(file_path) pdf_size_mb = pdf_size_bytes / (1024 * 1024) # Check if the PDF file size exceeds the maximum allowed size if pdf_size_mb > MAX_PDF_SIZE_MB: exceptions.append((file_path, f"The file exceeds the maximum allowed size of {MAX_PDF_SIZE_MB} MB.")) print(f'{file_path} - Exceeds maximum allowed size of {MAX_PDF_SIZE_MB} MB.') return results, exceptions # Open the PDF file and read its content into a BytesIO buffer with file_path.open('rb') as file: pdf_data_buffer = BytesIO(file.read()) # Iterate through each page of the PDF for page_number, page in enumerate(PDFPage.get_pages(pdf_data_buffer, check_extractable=True)): # Extract text from the current page page_text = extract_text(pdf_data_buffer, page_numbers=[page_number]) # Process the extracted text to extract information page_results, page_exceptions = extract_information(page_text, file_path.name, page_number + 1) # Page numbers are 1-based # Extend results and exceptions lists with page-specific results and exceptions results.extend(page_results) exceptions.extend(page_exceptions) except FileNotFoundError as e:[![enter image description here](https://i.sstatic.net/2jJln4M6.png)](https://i.sstatic.net/2jJln4M6.png) # Handle case where the file does not exist exceptions.append(e) print(f"FileNotFoundError: {e}") raise except Exception as e: # Handle any other unexpected exceptions exceptions.append(e) print(f"Exception: {e}") raise return results, exceptions </code>
def process_file(file_path: Path):
    """
    Process a PDF file to extract text and information.

    Args:
    - file_path (Path): Path object representing the location of the PDF file.

    Returns:
    - Tuple[List, List]: A tuple containing two lists:
        1. List of extracted results.
        2. List of encountered exceptions during processing.

    Raises:
    - FileNotFoundError: If the specified file_path does not exist.
    - Exception: For any other unexpected errors during processing.
    """
    results = []        # List to store extracted information from each page
    exceptions = []     # List to store exceptions encountered during processing
    
    try:
        # Check the size of the PDF file
        pdf_size_bytes = os.path.getsize(file_path)
        pdf_size_mb = pdf_size_bytes / (1024 * 1024)
        
        # Check if the PDF file size exceeds the maximum allowed size
        if pdf_size_mb > MAX_PDF_SIZE_MB:
            exceptions.append((file_path, f"The file exceeds the maximum allowed size of {MAX_PDF_SIZE_MB} MB."))
            print(f'{file_path} - Exceeds maximum allowed size of {MAX_PDF_SIZE_MB} MB.')
            return results, exceptions

        # Open the PDF file and read its content into a BytesIO buffer
        with file_path.open('rb') as file:
            pdf_data_buffer = BytesIO(file.read())
            
            # Iterate through each page of the PDF
            for page_number, page in enumerate(PDFPage.get_pages(pdf_data_buffer, check_extractable=True)):
                # Extract text from the current page
                page_text = extract_text(pdf_data_buffer, page_numbers=[page_number])
                
                # Process the extracted text to extract information
                page_results, page_exceptions = extract_information(page_text, file_path.name, page_number + 1)  # Page numbers are 1-based
                
                # Extend results and exceptions lists with page-specific results and exceptions
                results.extend(page_results)
                exceptions.extend(page_exceptions)
    
    except FileNotFoundError as e:[![enter image description here](https://i.sstatic.net/2jJln4M6.png)](https://i.sstatic.net/2jJln4M6.png)
        # Handle case where the file does not exist
        exceptions.append(e)
        print(f"FileNotFoundError: {e}")
        raise
    
    except Exception as e:
        # Handle any other unexpected exceptions
        exceptions.append(e)
        print(f"Exception: {e}")
        raise
    
    return results, exceptions

The problem is that I run out of RAM, even with 32 GB installed:

Through my research, I learned that PDFs cannot be read randomly; they must be read sequentially from the beginning to the end of the file, which is how I implemented it.

Some of my PDFs are around 100 MB in size, never exceeding 200 MB, and some are quite long (1000 pages) with many images. Since I have to read all the pages when I process a PDF, the only workaround I could find was to limit the size of the PDFs I read to less than 100 MB. I also can’t think of a way to restrict the page count – because to determine the number of pages, I need to open and read the file.

How can I limit RAM usage in this program?

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị
Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa
Thiết kế website Thiết kế website Thiết kế website Cách kháng tài khoản quảng cáo Mua bán Fanpage Facebook Dịch vụ SEO Tổ chức sinh nhật