Using ASP.Net, I want to implement full text search using Lucene/Solr on a LARGE number of docs (word, pdf etc.) residing in a directory on a NAS drive. The NAS drive would be mapped as a network drive on the server. The list of documents get changed frequently. As per my research, Lucene doesnot index pdf/word docs directly. The raw data from the docs need to be extracted and then passed to the Lucene indexer. Is it advisable to use PDFBox and other third party tools to extract binary data and pass to Lucence indexer. What would be the impact on the performance on Lucene search? Shall I use Solr instead of Lucene as it supports indexing of pdf/word docs?
1
Yes, Solr supports out-of-the box (well, after a bit of configuration, see the examples from version 4.9 onwards) PDF and Word documents. The thing to note is that Solr != Lucene. Solr is a higher level abstraction over Lucene, and as such it has a different API, features and behaviour.
IMHO, the difference between Solr and Lucene utilisation can briefly be summarized as follows: Solr needs less configuration / set-up, makes for a quicker implementation, but will require more resources to run than Lucene. To detail: Solr comes with a REST API, a ton of caches and support for fancy features such as clustering. Some of these things are enabled by default. As such, Solr will need roughly more memory to run properly, and possibly more CPU. You should take all these into account when configuring it, because otherwise, if you do a trivial Solr implementation, you’ll get what seems to be the same behaviour as with the Lucene implementation, but requiring a lot more resources to run with similar performance.