I use tika to parse document. I have define a custom image parser use ocr, so the picture will be parsed by ocr.
When I use tika to parse microsoft word (doc/docx), I found the embedded picture to place at the end of the document.
I have read the source code in AbstractOOXMLExtractor
and found that it will first parse xhtml and then embedded files, so it will always be placed at the end.
public void getXHTML(ContentHandler handler, Metadata metadata, ParseContext context)
throws SAXException, XmlException, IOException, TikaException {
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
xhtml.startDocument();
// first parse xhtml
buildXHTML(xhtml);
// then the embedded files
// Now do any embedded parts
handleEmbeddedParts(xhtml, metadata, getEmbeddedPartMetadataMap());
// thumbnail
handleThumbnail(xhtml, metadata);
xhtml.endDocument();
}
And I found that there is no way to custom it.
I customed ParsingEmbeddedDocumentExtractor
、ParsingEmbeddedDocumentExtractorFactory
、ParsingEmbeddedDocumentExtractor
to write the parsed embedded picture to ParseContext
(not to the parsed result)
Then I parsed the docx as xml, replace the img
tag with parsed embedded file fetched from ParseContext
.
In that way, I can parsed picture in msword in right position.
I wonder is there a better way to do this work?
And why tika parse the embedded files in the end of document?
Any problem parse them where they appeared?