I’m developing a pdf reader/editor program in Java
using Swing
for UI with Apache PdfBox
. I want to get the objects and their states in each of the pdf pages. I mean the text, images, shapes as objects and if the text is highlighted, bold etc. (the state).
How to access all of the pdf objects to eventually display it with Swing
(Images on right positions, texts are right font family, italics, bold, highlighted etc)?
For viewing I’m fine with creating my own components. I just need to extract the raw data.
I’m slowly grasping the details of Apache PdfBox
. I can render the file pages as image etc. One thing that i couldn’t figure out is how to display the pdf pages. There is this method PdfRenderer.renderPageToGraphics()
that takes page number and Graphics2D
object and it renders that pdf page on that Graphics2D
object. But that doesn’t seem right to me because that way the texts are not selectable. Getting each page as BufferedImage
is not applicable too because first the text are not selectable here too, and the second rendering each page as BufferedImage
is too heavy on the memory.
There are classes like PDFTextStripper
that extracts texts from a pdf but i couldn’t be sure, does this extract only the texts? What about images, tables, shapes and other objects that could be in a PDF file?
I saw online that some people use PDFPagePanel
but that is deprecated. I’m using the version 3.0.2
.