I have two lists of data, A and B. These lists are themselves aggregated from multiple sources, and contain typos, abbreviations not found in the other, and also lack a 1-1 mapping, but will never have a value in A that maps to two values in B and vice-versa.
Right now, we’re doing a naive (string comparison) match to create a map between the two lists. That has about 80% accuracy. I’d like to get that accuracy to at least 90% (95% would be incredible).
Are there any software tools that can be used for something like this? I’d like some sort of tool that could traverse both lists and suggest matches.
Update from comments:
Right now, we only produce a hit if A[x] == B[y]
. That gives us matches for 80% of the data in the data sets (which contain roughly fifty thousand rows each). What I’d like to do is find a tool or develop one based on an algorithm that will allow me to suggest a match for two values that are likely to have the same meaning, e.g. KING ROAD
and KG RD
. These potential matches would then be provided to a human to review and approve or ignore. Generally, I’d use something like Levenshtein, but this is somewhat parametrized data (think addresses) and I don’t know how to apply something like Levenshtein to structured data.
11