I am encountering an issue when attempting to extract text from a PDF file. Specifically, I receive the following error message:
?Identity-H Unimplemented?
This error only occurs with certain PDF files, and from what I understand, it is related to the fonts that the extraction process is unable to recognize.
use lopdf::Document;
use std::error::Error;
fn extract_text_from_pdf(file_path: &str) -> Result<String, Box<dyn Error>> {
let doc = Document::load(file_path)?;
let total_pages = doc.get_pages().len() as u32;
let mut extracted_text = String::new();
for page_number in 1..=total_pages {
let page_text = doc.extract_text(&[page_number])?;
let cleaned_text = page_text.replace(&format!("Page {}", page_number), "");
extracted_text.push_str(&cleaned_text);
}
Ok(extracted_text)
}
fn main() {
let file_path = "src/data/temp_storage/DAN.pdf";
match extract_text_from_pdf(file_path) {
Ok(text) => {
println!("Extracted text: {}", text);
},
Err(e) => {
eprintln!("Failed to extract text: {:?}", e);
}
}
}
Is there a way to solve this problem, or is there a better way to extract text from PDFs?
New contributor
Oliver Bregneberg is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.