I’m trying to understand how something like this works, but I’m inexperienced and I’m trying to understand how the process would work.
Say you have ten categories, a thousand zip codes, and ten thousand items listed in each category, each of which has a description.
When a user does a search for (e.g.) “guitar hero xbox” in the “video games” category within “50 miles”, what happens next?
Would the database compare zip codes first, to find everything in range, then do a SELECT items WHERE category="video games" and zipcode=A OR B OR C OR D
, then execute some sort of search on the description of each listing in the result set?
How would that work?
I’m not looking for exact answers about CL, I’m looking for a general idea of how this works anywhere/everywhere.
I wouldn’t worry too much about how Craigslist (or any large site) do it, because their insane performance requirements heavily dictate architectural decisions.
Solving this in a simple way would result in a query like this:
# pseudo query
SELECT fields
FROM ads
WHERE ad.title MATCHES "search keywords"
AND ad.location IS IN PROXIMITY to "search location"
Now, both searching by keyword and searching by distance can be complex tasks in themselves. Typically, filtering by distance can be done much faster than filtering by words, so it’s best to get that out of the way first.
That can be done with some relatively easy math if you have co-ordinates of your search location as well as for each entry in your database. You’d implement something like this – http://www.movable-type.co.uk/scripts/latlong.html. There are also more appropriate geospatial options if you have access to that.
Filtering by keyword usually starts with something trivial like: WHERE title LIKE '%search%' OR body LIKE '%search'
, but that’s insanely slow and lacks the ability to sort by relevance.
Next, you move onto some sort of fulltext search – with MySQL, MyISAM provides the MATCH x AGAINST y method, which also gives you relevance.
Beyond this, people usually move to some sort of dedicated search program. Sphinx, Lucene and Elastic Search come to mind.
Looking back at Craigslist, they actually use Sphinx Search and have quite a complex infrastructure that probably won’t be very helpful to you. Last time I checked, they are the largest user of Sphinx, so their findings are quite academic to other large users of it.
1
Many performant search products (either proprietary or otherwise) use Lucene as a basis. Other popular choices for “Big Data” include various forms of NoSQL. Another approach (although this tends to be a bit slower) is full-text searching in a regular SQL database.
These are how some searches similar to craigslist’s search work – though I have no idea what craigslist uses to power its searching.