I wrote code in Python for an application in Streamlit. This specific piece of my code matches tag numbers from the Excel file against the patterns defined for “first_half” and “second_half” tag numbers in the PDF data (which is a processed PDF file which gets made into text). However, when I run my application, the match dataframe contains PDF data that cannot exist. It finds pattern matches which do not exist in the PDF file (and thus wrongfully makes matches with the Excel tagnumbers).
I followed the following steps:
- I made sure to extracted text from the PDF did not have any “hidden information” and the text only contains what is in the PDF file. This is correct, nothing goes wrong here.
- I used a debugging statement which enabled logging (logging.basicConfig(level=logging.DEBUG)) to check whether the values found for “first_half” and “second_half” dit match what is in the PDF. This is correct, nothing goes wrong here.
But somehow, in the results dataframe, the columns “first_half” and “second_half” still contain information that is not found in the PDF (but is found in the Excel file) and thus make matches which aren’t there.
For example, the first_half matches found in the pdf are AA – BB – CC. But the Excel contains AA – BB – CC – DD, then it suddenly will display DD as a found value for “first_half” and thus will say that this match is found in both the Excel and PDF files.
The piece of code which compares those values is as follows:
# Split the unmatched Excel matches into different parts
if not unmatched_excel_matches_df.empty:
unmatched_excel_matches_df['Excel_First_Half'] = unmatched_excel_matches_df['Match_Excel'].str.split('-', n=1).str[0]
unmatched_excel_matches_df['Excel_Second_Half'] = unmatched_excel_matches_df['Match_Excel'].str.split('-', n=2).str[2]
unmatched_excel_matches_df['Excel_First_Part'] = unmatched_excel_matches_df['Match_Excel'].str.split('-').str[1]
unmatched_excel_matches_df['Excel_Second_Part'] = unmatched_excel_matches_df['Excel_First_Half'] + '-' + unmatched_excel_matches_df['Excel_Second_Half']
unmatched_excel_matches_df = unmatched_excel_matches_df.drop(columns=['Excel_First_Half', 'Excel_Second_Half'])
# Iterate over unmatched_excel_matches_df rows
for index, row in unmatched_excel_matches_df.iterrows():
# Check against the first half
first_half_match = re.search(patterns_PDF["first_half"], row["Excel_First_Part"])
if first_half_match:
unmatched_excel_matches_df.at[index, "First_Half_Match"] = first_half_match.group()
else:
unmatched_excel_matches_df.at[index, "First_Half_Match"] = None
# Check against the second half
second_half_match = re.search(patterns_PDF["second_half"], row["Excel_Second_Part"])
if second_half_match:
unmatched_excel_matches_df.at[index, "Match_PDF"] = second_half_match.group()
else:
unmatched_excel_matches_df.at[index, "Match_PDF"] = None
#drop unneccesary columns
unmatched_excel_fixed_df = unmatched_excel_matches_df.drop(columns=['Excel_First_Part', 'Excel_Second_Part', 'First_Half_Match'])
I’ve had this code work for a different set of documents, but now it suddenly does this, and I cannot find out why or where it goes wrong. Can someone please give me an idea as for why this happens?
Help_needed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.