Suppose I have 2 strings:
string1 = "home/lisa/Music/some_files/01.05 - Garden Ground.mp3"
string2 = "Music/Jim Smith/Unknown/(Deluxe Version/Garden Ground).mp3"
string4 = "Music/Jim Smith/Unknown/00 - Garden Ground.mp3"
Basically, I want to know if string 2 contains the same mp3 file as string 1. In the above example, you can see this is the case with Garden Ground.mp3. What’s the best way to go about this? Should I try Levenshtein distance? I thought about doing a regular expression, but there’s no guarantee string 2 will be formatted with parentheses in the exact same place every time. In fact, string 2 could look like string 4, for example.
6
I don’t really think any existing algorithm is going to work in your case. In your case, you are not looking for similarity between strings, but if strings contain similar values. Levenstein distance is about similarity of strings, not if strings contain similar values.
In your case, the simplest algorithm would be (pseudocode).
- Get name of each file without directory (I’m assuming directory doesn’t matter)
- Split each name into words based on predefined separators (‘ ‘, ‘-‘, ‘;’, ‘.’, etc.. you have to do some trial-and-error here to find the right separators)
- Compare how many words are same between two names (eg, in your case there are 2 same words between each of the files)
- If the count of same words exceeds some threshold, consider them same
The last step is obviously the hardest one. Figuring the threshold to minimize false positives and negatives is quite a challenge. The simplest is absolute threshold (eg. count >= 2 would match your case), but it might not be enough. Relative threshold (eg. count / totalWordsCount > 50%) might work, but it is harder to test and highly depends on what separators you pick.