I started building a web crawler and read somewhere that it’s a very hard problem to find a good seed page for the crawler. Can anyone explain me if there is any pre-defined procedure/ guidlines of finding a good seed page? or how you say that a particular page is a good seed page?
2
A good seed page needs to have
- As many links as possible
- For as many different topics as possible
That’s all, really. The first things that come to mind would be Wikipedia and the Open Directory Project.
1
Since I believe generalistic web search should be now considered a solved problem with a single provider that has such an headstart, and momentum, that it’s unpractical, unconvenient, and kinda lame to even consider outrunning it, I believe you might prefer focusing your search efforts on a specialized domain.
e.g.: U.S. law, E.U. law, local laws of certain foreign countries,
documents in a certain language, fishing, hunting, the military,
ambientalism, indian car manifacturers, south american team games,
russian motorsports…
Specializing your domains is all about finding the appropriate seeds for the given domain. (Also, IMHO, a handul of selected, unknowing, specialized forums should also be considered as a starting point) So once you have you domain, if you dig it well, (or at least the one you’re working with/for does) the seeds should come natural.
Otherwise head for dmoz.org like Google did some 15 years ago. (And don’t forget to weight in yahoo.com for good measure) Give your crawler another 15 years and you should have a good link base to search upon.
Ok, ok, if you really really must take this hard and pointeless route
also consider: reddit.com, twitter.com, and (why not) the
stackexchange.com constellation.
2