I am working to digitize these tables using google’s Form Parser, but have been struggling to get accurate replication. I’ve tried to read the table from an image into a csv, but it is still missing values, missing empty cells, or mixing values from different rows (even after preprocessing the image by increasing contrast, applying adaptive thresholding, and noise removal). Here is an example of the tables I’m trying to digitize: interpol crime statistics.
While I only need the “number of offenses known to the police”, “offenders”, “females”, “juveniles”, and “aliens” columns, I’ve been trying to read the entire table so I can use the Form Parser. Any suggestions on how I can increase the accuracy?
The function I have been using for google:
def extract_image(image_path, client, name, api_key):
print(f”Extracting the table from: {image_path}.”)
with open(image_path, ‘rb’) as f:
image_content = f.read()
raw_document = documentai.RawDocument(content=image_content, mime_type='image/png')
request = documentai.ProcessRequest(name=name, raw_document=raw_document)
try:
result = client.process_document(request=request)
print(f"Result:{result}")
table_data = []
# # Extract additional data (country, year, page)
# metadata = parse(image_path, api_key)
# table_data.append(['Start of Table:'])
# table_data.append([metadata])
# Extracting table data
for page in result.document.pages:
for table in page.tables:
header_rows = list(table.header_rows)
# print(f"Header Row: {header_rows}")
body_rows = list(table.body_rows)
# print(f"Body Row: {body_rows}")
for row in header_rows + body_rows:
row_data = [get_text(result.document, cell.layout.text_anchor) for cell in row.cells]
# print(f"Row Data: {row_data}")
table_data.append(row_data)
return table_data
except Exception as e:
print(f"An error occurred while processing {image_path}: {e}")
return []
Lillian Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.