For one of the projects at uni we were given the task to create a custom, niche search engine. My colleges and I split the tasks amongst each other so that we can tackle the overall project more easily. My part is to create the indexer. I already read the wikipedia page on search engine indexers and some other related articles but I’m still struggling to understand exactly how it works and how it looks.
To me it is obvious that it is not just a regular table with an index and a descrption column. So my question would be, what is a search engine indexer comprised of, how does it architecture look, and where to start from in building one?
5
At its core, a search engine index is simply an index that supports full text search. The most simple way to do that is a simple inverted index, i.e. for each word that occurs in any of the documents you have indexed, store a list of references to all the documents that contain this word.
For a university project, that’s probably enough, but of course there’s infinite room for improvement. You can combine multiple search words using AND and OR logic, and have a weight for each document depending on where and how often a word appears in it. That’s the state of WWW search engines circa 1998, before Google revolutionized it with their PageRank algorithm. Since then, they’ve had hundreds (if not thousands) of people working on improving it continuously.
Additionally, to support an index for the entire WWW (or even a small part of it), you need a distributed architecture, something like MapReduce.
3