I’m trying to integrate TIKA to detect file types in content management system.
Unfortunately, it fails to detect CSV format.
I’ve inspected it detailed, and it seems, it can detect CSV if separator is coma or tab, but it fails for semicolon (which unfortunately is Excel ‘standard’):
for (char delimiter : new char[]{'t', ';', ','}) {
System.out.println("With delimiter: <" + delimiter+">");
CSVFormat csvFormat = CSVFormat.DEFAULT.builder().setDelimiter(delimiter).setHeader("Name","Value","Count").build();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
CSVPrinter printer = new CSVPrinter(new OutputStreamWriter(baos), csvFormat);
for (int i=1;i<=100;i++) {
printer.printRecord("Name"+i,"Value"+i,i);
}
printer.flush();
ContentHandler handler = new BodyContentHandler();
ParseContext context = new ParseContext();
Metadata metadata = new Metadata();
Parser parser = new TextAndCSVParser();
parser.parse(new ByteArrayInputStream(baos.toByteArray()), handler, metadata, context);
System.out.println(metadata);
}
output:
With delimiter: < >
csv:num_rows=101 csv:num_columns=3 Content-Encoding=windows-1252 csv:delimiter=tab Content-Type=text/tsv; charset=windows-1252; delimiter=tab
With delimiter: <;>
Content-Encoding=windows-1252 Content-Type=text/plain; charset=windows-1252
With delimiter: <,>
csv:num_rows=101 csv:num_columns=3 Content-Encoding=windows-1252 csv:delimiter=comma Content-Type=text/csv; charset=windows-1252; delimiter=comma
Is it a bug or supporting many possible CSV formats was deemed to hard/unreliable and given up upon? Is there any way to make Tica supporting CSV? I’m using version 2.9.2