The Scenario is to extract chapter and it’s contents from docx file and convert it to individual chapter as pdf.
The chapter heading will be highlighted in bold and italics based on that we will extract contents.
I have used Apache tika to convert docx to html and extract the contents. Using pdfbox I have converted html to pdf but in some cases Apache tika does not render bullets and table design correctly. Also I have used doc4j for converting docx to html but special characters are not retrieved properly.
So for above scenario to be achieved what can we do either enhance Apache tika or any other possible ways to extract contents without affecting style?
Tried using Apache poi – cannot render exact styles
Apache tika – bullet and tables are not in right format from docx file
Doc4j – special characters not rendered correctly
Prem kumar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.