I’m working on a review site, where there is a minor issue with almost duplicate reviews across items. Just a few words are changed. It would be very nice to be able to uncover these duplicates before they are approved by a moderator, and I’m hoping someone could chime in on the best strategy to get there.
The site is running Ruby on Rails on a Postgres database and using Thinking Sphinx for search (all on Heroku), and so far the best option I see is to be pulling all the reviews out of the db and using a module like amatch to compare the strings. Not very efficient, so in this case I guess I’ll have to limit the number/age of reviews to scan for dupes.
Anyone got a better idea?
3
You could put together a SQL query using its own string matching functions: http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html
You’d need to join the table against itself (or take a cross product by doing SELECT ... FROM table AS t1, table AS t2 ...
) and then filter based on the string conditions that you want.
But, at the end of the day, Postgres will need to do table-scans in order to produce the results, so do be careful if you have a large database. Either limit the scope, chunk up the query, or run it at an off-peak time.
1