I have a table with polygons and district names; I also have data on purchases with exact longitude and latitude. I wrote a function that checks for every coordinate pair a match in a polygon; then it assigns district name for a purchase. The problem is that it works very-very slow due to lack of vectorization and nested for-loops (thanks, pandas). How can I optimize so it will digest 10+ million rows in less time?
def get_district_name(geo_df: pd.DataFrame, ship_df: pd.DataFrame, col_name: str, frac: int=0.65) -> pd.DataFrame:
sample_ship = ship_df.sample(frac=frac, replace=False, random_state=42).reset_index(drop=True)
sample_ship['municipal_district_name'] = ''
for i in tqdm(range(len(sample_ship))):
point = shapely.geometry.Point(sample_ship['address_longitude'][i], sample_ship['address_latitude'][i])
for j in range(len(geo_df)):
if point.within(geo_df.geometry[j]):
sample_ship['municipal_district_name'][i] = geo_df[col_name][j]
continue
return sample_ship