Thiết kế website giá rẻ

Question

As part of my program, I’m trying to use the pdfminer third-party library in Python to open and read the PDF pages, and then use regular expressions to search for specific patterns. I’m also using multiprocessing to parallelize this, because I have a large number of PDFs to analyze. Each process should be handling a single PDF.

I have this code to set up multiprocessing:

<code>def process_theme_files(theme_dir: Path, processed_files: Optional[Set[str]] = None, theme_processed_files: Optional[Set[str]] = None) -> Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]:

"""

Processes files of a specific theme in a multiprocessed manner.

Parameters:

- theme_dir (Path): Path object pointing to the theme directory.

- processed_files (set, optional): Set of globally processed files. Defaults to None.

- theme_processed_files (set, optional): Set of files processed in the current theme. Defaults to None.

Returns:

- Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]: A tuple containing a list of tuples

containing extracted information from the theme files and a list of exceptions encountered during the processing.

"""

results = []

exceptions = []

# Number of processes to be used (can be adjusted as needed)

num_processes = multiprocessing.cpu_count()

# Initialize processed_files and theme_processed_files as empty sets if not provided

if processed_files is None:

processed_files = set()

if theme_processed_files is None:

theme_processed_files = set()

# Get PDF files in the theme directory

pdf_files = list(theme_dir.glob('**/*.pdf'))

# Create progress bar

with tqdm(total=len(pdf_files), desc='Processing PDFs', unit='file') as pbar:

# Process PDF files in parallel using ProcessPoolExecutor

with concurrent.futures.ProcessPoolExecutor(max_workers=num_processes) as executor:

# Map the process_file function to each PDF file in the list

future_to_file = {executor.submit(process_file, pdf_file): pdf_file for pdf_file in pdf_files}

# Iterate over results as they become available

for future in concurrent.futures.as_completed(future_to_file):

pdf_file = future_to_file[future]

try:

# Get the result of the task

file_results, file_exceptions = future.result()

# Extend the results list

results.extend(file_results)

# Append specific exceptions to the exceptions list

exceptions.extend(file_exceptions)

except FileNotFoundError as fnfe:

exceptions.append(f"File not found: {fnfe.filename}")

except Exception as e:

# Capture and log the generic exception

exceptions.append(f"Error processing file '{pdf_file}': {e}")

# Update the progress bar

pbar.update(1)

return results, exceptions

</code>

<code>def process_theme_files(theme_dir: Path, processed_files: Optional[Set[str]] = None, theme_processed_files: Optional[Set[str]] = None) -> Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]: """ Processes files of a specific theme in a multiprocessed manner. Parameters: - theme_dir (Path): Path object pointing to the theme directory. - processed_files (set, optional): Set of globally processed files. Defaults to None. - theme_processed_files (set, optional): Set of files processed in the current theme. Defaults to None. Returns: - Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]: A tuple containing a list of tuples containing extracted information from the theme files and a list of exceptions encountered during the processing. """ results = [] exceptions = [] # Number of processes to be used (can be adjusted as needed) num_processes = multiprocessing.cpu_count() # Initialize processed_files and theme_processed_files as empty sets if not provided if processed_files is None: processed_files = set() if theme_processed_files is None: theme_processed_files = set() # Get PDF files in the theme directory pdf_files = list(theme_dir.glob('**/*.pdf')) # Create progress bar with tqdm(total=len(pdf_files), desc='Processing PDFs', unit='file') as pbar: # Process PDF files in parallel using ProcessPoolExecutor with concurrent.futures.ProcessPoolExecutor(max_workers=num_processes) as executor: # Map the process_file function to each PDF file in the list future_to_file = {executor.submit(process_file, pdf_file): pdf_file for pdf_file in pdf_files} # Iterate over results as they become available for future in concurrent.futures.as_completed(future_to_file): pdf_file = future_to_file[future] try: # Get the result of the task file_results, file_exceptions = future.result() # Extend the results list results.extend(file_results) # Append specific exceptions to the exceptions list exceptions.extend(file_exceptions) except FileNotFoundError as fnfe: exceptions.append(f"File not found: {fnfe.filename}") except Exception as e: # Capture and log the generic exception exceptions.append(f"Error processing file '{pdf_file}': {e}") # Update the progress bar pbar.update(1) return results, exceptions </code>

def process_theme_files(theme_dir: Path, processed_files: Optional[Set[str]] = None, theme_processed_files: Optional[Set[str]] = None) -> Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]:
    """
    Processes files of a specific theme in a multiprocessed manner.

    Parameters:
    - theme_dir (Path): Path object pointing to the theme directory.
    - processed_files (set, optional): Set of globally processed files. Defaults to None.
    - theme_processed_files (set, optional): Set of files processed in the current theme. Defaults to None.

    Returns:
    - Tuple[List[Tuple[str, int, str, str, str, str, str, str]], List[Exception]]: A tuple containing a list of tuples
      containing extracted information from the theme files and a list of exceptions encountered during the processing.
    """
    results = []
    exceptions = []
    
    # Number of processes to be used (can be adjusted as needed)
    num_processes = multiprocessing.cpu_count()

    # Initialize processed_files and theme_processed_files as empty sets if not provided
    if processed_files is None:
        processed_files = set()
    if theme_processed_files is None:
        theme_processed_files = set()

    # Get PDF files in the theme directory
    pdf_files = list(theme_dir.glob('**/*.pdf'))

    # Create progress bar
    with tqdm(total=len(pdf_files), desc='Processing PDFs', unit='file') as pbar:
        # Process PDF files in parallel using ProcessPoolExecutor
        with concurrent.futures.ProcessPoolExecutor(max_workers=num_processes) as executor:
            # Map the process_file function to each PDF file in the list
            future_to_file = {executor.submit(process_file, pdf_file): pdf_file for pdf_file in pdf_files}
            
            # Iterate over results as they become available
            for future in concurrent.futures.as_completed(future_to_file):
                pdf_file = future_to_file[future]
                try:
                    # Get the result of the task
                    file_results, file_exceptions = future.result()
                    # Extend the results list
                    results.extend(file_results)
                    # Append specific exceptions to the exceptions list
                    exceptions.extend(file_exceptions)
                except FileNotFoundError as fnfe:
                    exceptions.append(f"File not found: {fnfe.filename}")
                except Exception as e:
                    # Capture and log the generic exception
                    exceptions.append(f"Error processing file '{pdf_file}': {e}")
                
                # Update the progress bar
                pbar.update(1)

    return results, exceptions

And this code for processing individual files:

<code>def process_file(file_path: Path):

"""

Process a PDF file to extract text and information.

Args:

- file_path (Path): Path object representing the location of the PDF file.

Returns:

- Tuple[List, List]: A tuple containing two lists:

1. List of extracted results.

2. List of encountered exceptions during processing.

Raises:

- FileNotFoundError: If the specified file_path does not exist.

- Exception: For any other unexpected errors during processing.

"""

results = [] # List to store extracted information from each page

exceptions = [] # List to store exceptions encountered during processing

try:

# Check the size of the PDF file

pdf_size_bytes = os.path.getsize(file_path)

pdf_size_mb = pdf_size_bytes / (1024 * 1024)

# Check if the PDF file size exceeds the maximum allowed size

if pdf_size_mb > MAX_PDF_SIZE_MB:

exceptions.append((file_path, f"The file exceeds the maximum allowed size of {MAX_PDF_SIZE_MB} MB."))

print(f'{file_path} - Exceeds maximum allowed size of {MAX_PDF_SIZE_MB} MB.')

return results, exceptions

# Open the PDF file and read its content into a BytesIO buffer

with file_path.open('rb') as file:

pdf_data_buffer = BytesIO(file.read())

# Iterate through each page of the PDF

for page_number, page in enumerate(PDFPage.get_pages(pdf_data_buffer, check_extractable=True)):

# Extract text from the current page

page_text = extract_text(pdf_data_buffer, page_numbers=[page_number])

# Process the extracted text to extract information

page_results, page_exceptions = extract_information(page_text, file_path.name, page_number + 1) # Page numbers are 1-based

# Extend results and exceptions lists with page-specific results and exceptions

results.extend(page_results)

exceptions.extend(page_exceptions)

except FileNotFoundError as e:[![enter image description here](https://i.sstatic.net/2jJln4M6.png)](https://i.sstatic.net/2jJln4M6.png)

# Handle case where the file does not exist

exceptions.append(e)

print(f"FileNotFoundError: {e}")

raise

except Exception as e:

# Handle any other unexpected exceptions

exceptions.append(e)

print(f"Exception: {e}")

raise

return results, exceptions

</code>

<code>def process_file(file_path: Path): """ Process a PDF file to extract text and information. Args: - file_path (Path): Path object representing the location of the PDF file. Returns: - Tuple[List, List]: A tuple containing two lists: 1. List of extracted results. 2. List of encountered exceptions during processing. Raises: - FileNotFoundError: If the specified file_path does not exist. - Exception: For any other unexpected errors during processing. """ results = [] # List to store extracted information from each page exceptions = [] # List to store exceptions encountered during processing try: # Check the size of the PDF file pdf_size_bytes = os.path.getsize(file_path) pdf_size_mb = pdf_size_bytes / (1024 * 1024) # Check if the PDF file size exceeds the maximum allowed size if pdf_size_mb > MAX_PDF_SIZE_MB: exceptions.append((file_path, f"The file exceeds the maximum allowed size of {MAX_PDF_SIZE_MB} MB.")) print(f'{file_path} - Exceeds maximum allowed size of {MAX_PDF_SIZE_MB} MB.') return results, exceptions # Open the PDF file and read its content into a BytesIO buffer with file_path.open('rb') as file: pdf_data_buffer = BytesIO(file.read()) # Iterate through each page of the PDF for page_number, page in enumerate(PDFPage.get_pages(pdf_data_buffer, check_extractable=True)): # Extract text from the current page page_text = extract_text(pdf_data_buffer, page_numbers=[page_number]) # Process the extracted text to extract information page_results, page_exceptions = extract_information(page_text, file_path.name, page_number + 1) # Page numbers are 1-based # Extend results and exceptions lists with page-specific results and exceptions results.extend(page_results) exceptions.extend(page_exceptions) except FileNotFoundError as e:[![enter image description here](https://i.sstatic.net/2jJln4M6.png)](https://i.sstatic.net/2jJln4M6.png) # Handle case where the file does not exist exceptions.append(e) print(f"FileNotFoundError: {e}") raise except Exception as e: # Handle any other unexpected exceptions exceptions.append(e) print(f"Exception: {e}") raise return results, exceptions </code>

def process_file(file_path: Path):
    """
    Process a PDF file to extract text and information.

    Args:
    - file_path (Path): Path object representing the location of the PDF file.

    Returns:
    - Tuple[List, List]: A tuple containing two lists:
        1. List of extracted results.
        2. List of encountered exceptions during processing.

    Raises:
    - FileNotFoundError: If the specified file_path does not exist.
    - Exception: For any other unexpected errors during processing.
    """
    results = []        # List to store extracted information from each page
    exceptions = []     # List to store exceptions encountered during processing
    
    try:
        # Check the size of the PDF file
        pdf_size_bytes = os.path.getsize(file_path)
        pdf_size_mb = pdf_size_bytes / (1024 * 1024)
        
        # Check if the PDF file size exceeds the maximum allowed size
        if pdf_size_mb > MAX_PDF_SIZE_MB:
            exceptions.append((file_path, f"The file exceeds the maximum allowed size of {MAX_PDF_SIZE_MB} MB."))
            print(f'{file_path} - Exceeds maximum allowed size of {MAX_PDF_SIZE_MB} MB.')
            return results, exceptions

        # Open the PDF file and read its content into a BytesIO buffer
        with file_path.open('rb') as file:
            pdf_data_buffer = BytesIO(file.read())
            
            # Iterate through each page of the PDF
            for page_number, page in enumerate(PDFPage.get_pages(pdf_data_buffer, check_extractable=True)):
                # Extract text from the current page
                page_text = extract_text(pdf_data_buffer, page_numbers=[page_number])
                
                # Process the extracted text to extract information
                page_results, page_exceptions = extract_information(page_text, file_path.name, page_number + 1)  # Page numbers are 1-based
                
                # Extend results and exceptions lists with page-specific results and exceptions
                results.extend(page_results)
                exceptions.extend(page_exceptions)
    
    except FileNotFoundError as e:[![enter image description here](https://i.sstatic.net/2jJln4M6.png)](https://i.sstatic.net/2jJln4M6.png)
        # Handle case where the file does not exist
        exceptions.append(e)
        print(f"FileNotFoundError: {e}")
        raise
    
    except Exception as e:
        # Handle any other unexpected exceptions
        exceptions.append(e)
        print(f"Exception: {e}")
        raise
    
    return results, exceptions

The problem is that I run out of RAM, even with 32 GB installed:

Through my research, I learned that PDFs cannot be read randomly; they must be read sequentially from the beginning to the end of the file, which is how I implemented it.

Some of my PDFs are around 100 MB in size, never exceeding 200 MB, and some are quite long (1000 pages) with many images. Since I have to read all the pages when I process a PDF, the only workaround I could find was to limit the size of the PDFs I read to less than 100 MB. I also can’t think of a way to restrict the page count – because to determine the number of pages, I need to open and read the file.

How can I limit RAM usage in this program?

Thiết kế website giá rẻ

Danh mục

Python PDF searcher overflows the RAM