I have a database table holding 40 million records (table A).
Each record has a string a user can search for.
I also have a table with a list of character replacements (table B) i.e. i = Y, I = 1 etc.
I need to be able to take the string a user is searching for, iterate through each letter and create an array of every possible outcome (the users string, then each outcome with alternative letters used).
I need to check for alternatives on both lower and uppercase letters in the word
A search string can be no longer than 10 characters long.
I’m using PHP and a MySQL database.
Does anyone have any thoughts / articles / guidance on doing this in an efficient way?
4
I have a database table holding 40 million records (table A).
It sounds like an RDBMS is not appropriate for this task.
As @Jim Arnold says, you should use either Lucene or Endeca for this task.
Here is what I see possible based on my understanding:
One way is to create a column (preferably on table A) and have a database trigger (or write code in your language) such that whenever a row is inserted or update, a new column (shown below as “SearchColumn”) would get populated with ONLY ONE STANDARD FORMAT of the string value. In the example blow I have arbitrary chosen upper case.
The seach column need to have a non-unique index.
When the user performs a search for say,’yes’, you do:
SELECT FROM myTable WHERE SearchColumn=”YES” (see the last paragraph regarding the index cases sensitivity).
myTable would include the following columns (amongst others of course):
OriginalColumn, SearchColumn
YES, YES
Y3S, YES
Y55, YSS
yes, YES
NO, NO
N0, NO
…
By doing this, you don’t need to perform non-indexed row searching. This method, since it uses an index will be very fast. The overhead of processing the extra column should not be significant, unless you have simultaneous thousands of concurrent inserts/updates.
In MySQL (and some other database), you can control whether an index uses case sensitivity or not via the column collating properties definition.
2