Learner of Latin and Ancient Greek here.
Elasticsearch has a Greek stemmer, which according to my experiments works very well in my Elasticsearch-based application with Ancient Greek. No doubt that’s because after normalisation (stripping of accents) it probably isn’t that different to Modern Greek, at least for stemming and searching purposes. It proves very useful as I, a beginner, gradually create documents full of vocab notes (my application parses .docx documents and generates (Lucene) Documents consisting of overlapping 10-paragraph chunks). So I can quickly look up previously researched words and expressions.
For Latin Elasticsearch has no stemmer-analyser. I just did a bit of searching on this and found one or two usually quite old documents out there about how this might be developed. At Github I found a couple of very old projects seemingly having started with this.
But I couldn’t find a ready-made stemmer module which I might be able to just get hold of and try to use with minimal fuss with my Elasticsearch-based application.
A stemmer is not essential because I can also do non-stemmed searches. But it’d be very helpful, and I’d also be surprised if no academics anywhere have ever developed such a thing: there is an enormous corpus of Latin texts from more than 20 centuries. Anyone got any knowledge of such a project?