I’m working on a project where I need to extract text content from a PDF node using PDFBox. Here’s the code snippet I’m currently using:
pdDocument = PDDocument.load(new File(path));
public void getLabels() {
PDDocumentCatalog catalog = pdDocument.getDocumentCatalog();
PDStructureTreeRoot treeRoot = catalog.getStructureTreeRoot();
if (treeRoot != null) {
String indent = "";
for (Object obj : treeRoot.getKids()) {
if (obj instanceof PDStructureElement) {
addLabels((PDStructureElement) obj, indent);
}
}
}
}
void addLabels(PDStructureElement element, String indent) {
if (element.getStructureType().startsWith("H")) {
checkHeaderHierarchy(element, 0);
}
if (element.getStructureType().equalsIgnoreCase("Figure")) {
String altText = element.getAlternateDescription();
labels.add(indent + element.getStructureType() + " - " + altText);
if (altText == null || altText.isEmpty()) {
errors.add("Image without alternative text on page: " + getPageNumber(element.getPage()));
}
} else {
labels.add(indent + element.getStructureType() + " - " + element.getTitle());
}
for (Object obj : element.getKids()) {
if (obj instanceof PDStructureElement) {
addLabels((PDStructureElement) obj, indent + " ");
}
}
}
I’ve attempted to use element.getActualText() to retrieve the text content, but it always returns null regardless of the node’s content. Is there any other method or workaround I can use to achieve this?
Thanks in advance for any help or suggestions!
New contributor
Adam Brahim García is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.