data extraction from an image of a .pdf file using python in colab
- python to be used
- cloud environment: colab
- pdf file containing an image
https://sites.arizona.edu/njardarson-lab/files/2024/05/Top200BrandNameDrugBySales2023V1.pdf - data to be extracted using python scrips running in colab
- extracted data to be saved as a .csv file
- saved .csv file to be downloaded
https://sites.arizona.edu/njardarson-lab/files/2024/05/Top200BrandNameDrugBySales2023V1.pdf
we tried using poppler and Tesseract – OCR
we could get only 60% accuracy of data extraction
Our expectation is 100% accurate data extraction
New contributor
Balu Ranganathan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.