This is the code:
`for file in files.get(‘files’, []):
# … (Get file content as before)
# Extract data from the PDF
pdf_reader = PyPDF2.PdfReader(BytesIO(file_content))
page = pdf_reader.pages[0] # Assuming you want to extract from the first page
# 1. File Name
file_name = file['name']
print(f"File: {file_name}")
# 2. Process Number
process_number = None
process_number_match = None
process_number_match = re.search(r"(d{7}-d{2}.d{4}.d.d{2}.d{4})", page.extract_text())
if process_number_match:
process_number = process_number_match.group(1)
print(f"Process Number: {process_number}")
else:
print("erro, número do processo não encontrado")
# 3. Name
name = None # Reset the name variable
name_match = None
name_match = re.search(r"(AUTOR|INTERESSADOS|INTERESSADO|INTERESSADA):s+([A-Zs]+)", page.extract_text())
if name_match:
name = name_match.group(2)
print(f"Name: {name}")
else:
print("error, nome não encontrado")
# 4. Keywords
found_keywords = [] # Reset the found_keywords list
keywords = ["audiência", "subsídios", "cumprimento"]
for keyword in keywords:
if keyword in page.extract_text():
found_keywords.append(keyword)
if found_keywords:
print(f"Keywords Found: {', '.join(found_keywords)}")
else:
print("erro, pedido não encontrado")`
`
It will keep printing this:
Keywords Found: cumprimento
File: 33-00737.015338.pdf
Process Number: (number1)
Name:(name1)
S
Keywords Found: cumprimento
File: 32-00737.012571.pdf
Process Number: (number1)
Name:(name1)
S
Keywords Found: cumprimento
File: 31-00737.012592.pdf
Process Number: (number1)
Name:(name1)
S
Keywords Found: cumprimento
File: 30-00737.010470.pdf
Process Number: (number1)
Name:(name1)
S
Keywords Found: cumprimento
File: 29-00737.007060.pdf
Process Number: (number1)
Name:(name1)
The file number is getting updated, so it is reading the correct files. But it keeps repeating the other strings. I tried reseting it with = None, but didn’t work.
Tried using
3. Name
name = None # Reset the name variable
name_match = None
name_match = re.search(r"(AUTOR|INTERESSADOS|INTERESSADO|INTERESSADA):s+([A-Zs]+)", page.extract_text())
if name_match:
name = name_match.group(2)
print(f"Name: {name}")
else:
print("error, nome não encontrado")
I was expecting to print the name for each document. Instead I got the name right for the first document and it got repeated for all the others.
Victor Brandao is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.