I am trying out Tika’s ability to determine whether a file is corrupted and up till
now I don’t seem to be able to trigger exceptions which I kind of expect when I butcher
a PDF file to the level even Acrobat Reader can’t repair it anymore.
I started simple by removing the %%EOF of a PDF file and to my surprise Tika doesn’t
say a word about it missing. I removed bigger trailing parts, also pieces from the middle
and Tika keeps on parsing the file.
I am definitelymissing the point of how to use Tika for corruption detection and would
appreciate what i need to add for it to start tripping over simple format errors like a
missing %%EOF.
System.out.println("##################################################################");
System.out.println("### 1. Parsing using the Tika Facade ###");
System.out.println("### DOING A PARSE To PLain Text OF THE Input DOCUMENT ###");
System.out.println("##################################################################");
Tika tika = new Tika();
// No matter damaged or not, the files just get parsed and no exception thrown.
stream = TikaPOC.class.getResourceAsStream("GeneratedDamagedFile.pdf");
try {
parsedDocument = tika.parseToString(stream);
System.out.println("Parsed document: <" + parsedDocument + ">");
} catch (Exception e) {
System.out.println("Failed to parse document: " + e);
}