I am in the process of designing a search engine for an asp.net site. The site currently uses Microsoft Indexing Server to index and search content which range from simple text files to MS documents to PDFs. MIS is also used to crawl File servers. MIS in tandem with Index Server Companion crawls for content from external sites. I intend to replace MIS with the indexer/crawler I am trying to build.
The motivation for going away from MIS is
1.MSFT would be discontinuing MIS support in their upcoming Server 2012 releases.
2.Major PaaS providers do not support MIS.
Since my platform is completely on the Microsoft stack, I cant afford(due to deployment/maintenance issues than cost issues) to have a Java application server. Thus, Solr, and effectively, SolrNet is ruled out.
With this being the context, I have couple of questions.
1.Technology choice
I had done my initial investigation and looked at Lucene.Net. There seemed to be 2 issues in using Lucene.Net. First being, it cant crawl external content. There doesn’t seem to be a direct port of Nutch in .Net. Second, since it is just an indexer, it cant parse various document types. The parsing is left to the developer.
So, what would be best technology choice on the .Net platform to achieve indexing & crawling? Are there any .Net open source libraries available for document parsing?
2.Architectural pattern
Is there any general architectural pattern or best practice that needs to be followed in designing such a search engine?
2
- This link might give you some answers on how to parse various file formats, especially the link that shows how to use Tika from .net with IKVM.
- I am developing applications using the Microsoft stack and I can tell you that running a Tomcat server with a Solr instance is affordable. The only reason not to use Solr that would seem valid to me is that the hosting company cannot offer this kind of services, but even then, there is nothing preventing me from having another server running Tomcat on a different machine. I would prefer to spend money on this alternative rather than investing in developing yet another search engine. If you still want to go with developing your own search engine, I would recommend starting with the Solr documentation. There are tons of useful things there.
1