This is my code:
import time
st = time.time()
# Determine the smaller DataFrame
if len(purp_expanded) <= len(demog_expanded):
smaller_df = purp_expanded
larger_df = demog_expanded
else:
smaller_df = demog_expanded
larger_df = purp_expanded
# New DataFrame to store the merged results
merged_df = pd.DataFrame()
# Iterate through the smaller DataFrame
for index, row in smaller_df.iterrows():
# Find matching rows in the larger DataFrame
matching_rows = larger_df[
(larger_df['start_date'] == row['start_date']) &
(larger_df['end_date'] == row['end_date']) &
(larger_df['start_msoa'] == row['start_msoa']) &
(larger_df['end_msoa'] == row['end_msoa']) &
(larger_df['hour_part'] == row['hour_part'])
]
if not matching_rows.empty:
# Select the first matching row
match_row = matching_rows.iloc[0]
# Combine the rows from the smaller and larger DataFrame
combined_row = pd.concat([row, match_row], axis=0)
combined_row = combined_row.to_frame().T # Convert Series to DataFrame
# Add the combined row to the merged DataFrame using pd.concat
merged_df = pd.concat([merged_df, combined_row], ignore_index=True)
# Drop the selected matching row from the larger DataFrame
larger_df = larger_df.drop(match_row.name)
# Drop the current row from the smaller DataFrame
smaller_df = smaller_df.drop(index)
merged_df = merged_df.loc[:,~merged_df.columns.duplicated()].copy()
et = time.time()
tdiff = et - st
print(f"time elapsed: {tdiff} seconds")
This code merges two dataframes by matching rows, but when a row is matched, it as appended to a new empty dataframe, and the matched rows from both original dataframes are dropped. This operation is quite slow, possibly due to the for loop, and I was wondering if there is a way to speed this up.
3