I have a code to load pdf files, I extract the text from the entire document and then I have several key fields that I want to search to return certain information, the problem is that the pdfs despite having the same information, are all in different formats and with variations in the key fields that I want to search, for example, I have a function that returns information from a specific section of the text but in some files it does not find that section, despite it being there.
I put a function in my code that extracts information from section 2, in some files it works, in others it doesn’t, I put example prints, in the first one you can get the information, in the second one you can’t find the information, can anyone help?
Thanks
def extract_healthclassifications(text):
start_patterns = [“SECÇÃO 2: IDENTIFICAÇÃO DOS PERIGOS”, “Secção 2: Identificação dos perigos”]
end_patterns = [“SECÇÃO 3: COMPOSIÇÃO/INFORMAÇÃO SOBRE OS COMPONENTES”, “Secção 3: Composição/informação sobre os componentes”]
section_text = extract_specific_text(text, start_patterns, end_patterns)
if not section_text:
print(“Erro: Seção não encontrada.”)
return [“Informação não encontrada”]
print(f”Texto da seção extraído:n{section_text}”)
search_patterns = [“Acute Tox. 1”, “Acute Tox. 2”, “Acute Tox. 3”, “Acute Tox. 4”, “Asp. Tox. 1”, “Carc. 1A”, “Carc. 1B”, “Carc. 2”, “Eye Dam. 1”, “Eye Dam. 2”, “Eye Irrit. 1”, “Eye Irrit. 2”, “Lact.”, “Muta. 1B”, “Muta. 2”, “Repr. 1A”, “Repr. 1B”, “Repr. 2”, “Resp. Sens 1”, “Skin Corr.”, “Skin Corr. 1”, “Skin Corr. 1A”, “Skin Corr. 1B”, “Skin Corr. 1C”, “Skin Irrit. 2”, “Skin Sens. 1”, “Skin Sens. 1A”, “Skin Sens. 1B”, “STOT RE 1”, “STOT RE 2”, “STOT SE”, “STOT SE. 1”, “STOT SE. 2”, “STOT SE. 3″]
pattern = re.compile(r'(b(?:’ + ‘|’.join(search_patterns) + r’)b)(?:.*?s(Hd{3}[a-zA-Z]?))?’, re.IGNORECASE)
matches = pattern.findall(section_text)
print(f”Correspondências encontradas:n{matches}”)
health_classifications = list({f”{match[0].strip()} – {match[1].strip()}” if match[1] else match[0].strip() for match in matches})
return health_classifications if health_classifications else [“Informação não encontrada”]enter image description here
I need help standardizing the search for sections
Nuno Braga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.