I want to do a fuzzy match between two dfs. I need to use sth fast, like rapidfuzz, as there is a lot of data. The first df has a list of name with unique IDs and other data, e.g.:
ID Name A B
0 12445 John Smith a b
1 23455 Jack Smith a b
2 93854 David Brown x y
3 98736 Jason Smitdt a b
df1 = {'ID': {0: 12445, 1: 23455, 2: 93854, 3: 98736},
'Name': {0: 'John Smith',
1: 'Jack Smith',
2: 'David Brown',
3: 'Jason Smitdt'},
'A': {0: 'a', 1: 'a', 2: 'x', 3: 'a'},
'B': {0: 'b', 1: 'b', 2: 'y', 3: 'b'}}
The other df has a list of name with not unique ids e.g.
ID Name A B
0 12445 John Smith a b
1 12445 Johny Smith a b
2 23455 Jack Smith a b
3 93854 David Brown x y
4 98736 Jason Smitdt w NaN
5 98736 Jesse Smith w NaN
df2 = {'ID': {0: 12445, 1: 12445, 2: 23455, 3: 93854, 4: 98736, 5: 98736},
'Name': {0: 'John Smith',
1: 'Johny Smith',
2: 'Jack Smith',
3: 'David Brown',
4: 'Jason Smitdt',
5: 'Jesse Smith'},
'A': {0: 'a', 1: 'a', 2: 'a', 3: 'x', 4: 'w', 5: 'w'},
'B': {0: 'b', 1: 'b', 2: 'b', 3: 'y', 4: nan, 5: nan}}
I want match df1 against df2 and retrieve all the potential matches higher than let’s say 40%. I need to be able to modify ratio. I need to have each potential match in a separate row with all the additional information. If it matched against the name with the same ID, it is to be ignored. If it matched more than once to the names under same ID, need to retrieve only one result (highest ratio). So at the end it should look sth like this: