I have two data frames, with two columns in common. I want to flag records in the 1st data frame that have a two-column match in the 2nd, where the second match is “fuzzy”. Neither column is a unique key; values can be repeated in both of them. Imagine, for example, my matching columns are a city-column and a date-column.
INPUT DATAFRAME 1 (df1):
| city | date |
| —- | ——– |
| New York | 1/1/2024 |
| New York | 1/5/2024 |
| Chicago | 1/2/2024 |
| Chicago | 1/5/2024 |
| Houston | 1/3/2024 |
INPUT DATAFRAME 2 (df2):
| city | date |
| —- | ——– |
| New York | 1/2/2024 |
| Chicago | 1/5/2024 |
My desired output is to flag as True
every record in df1
that has a record in df2
that exactly matches the city AND matches within one day of the date, with all other records flagged as False
.
OUTPUT DATAFRAME 1:
| city | date | paired |
| —- | ——– | —— |
| New York | 1/1/2024 | True |
| New York | 1/5/2024 | False |
| Chicago | 1/2/2024 | False |
| Chicago | 1/5/2024 | True |
| Houston | 1/3/2024 | False |
If I was seeking exact matches on both columns, the problem would be very straightforward: I would zip the two columns to be used for matching together, and apply an isin
test inside of a where
method:
df1['zipcol'] = list(zip(df1.city, df1.date))
df2['zipcol'] = list(zip(df2.city, df2.date))
df1['paired'] = np.where(df1['zipcol'].isin(df2['zipcol']), True, False)
#some overhead to drop the zipped columns
What is the most efficient way to perform this type of comparison, except with one (or more) criteria that is not exact, but defined in terms of the values from the 2nd data frame being within a range of values centered on values from the 1st?
Final note: I am not worried about the possibility of duplicate matches here. If one-to-many matches do occur, they can be handled in a post-processing step later on.