I have a function that is simplified below but that I use to create a new column in an existing dataframe. Right now I am using iterrows, but I tried using .apply while filtering by the unique groups of Year and Week (no these can not be converted to dates for reasons), but it did not speed it up much. Any ideas to use vectorized functions to accomplish this more efficiently?
Sample Versions:
input_df =
Year | Week | Company |
---|---|---|
2024 | 4 | A |
2023 | 18 | A |
2023 | 10 | B |
historical_df =
Year | Week | Company | Volume |
---|---|---|---|
2024 | 2 | A | 43 |
2023 | 16 | A | 32 |
2023 | 12 | B | 67 |
LOGIC
For each row in input_df:
-Filter historical_df based on Year, Week, and Company so that the calculation is only based each company’s historical data based on that row’s Year and Week.
-Sort historical data in order to use .head() to pull the 16 most recent weeks worth of data (only up to the current row’s Year and Week)
-Use those (up to) 16 weeks worth of Volume data to calculate the coefficient of variability (st.dev / mean)
-Store that value in the list
-Use the list to create the new column in input_df
def add_variability(input_df, historical_df):
variability_calcs = []
for index, row in input_df.iterrows():
filtered_historical_df = historical_df[
((historical_df['Year'] < row['Year']) |
((historical_df['Year'] == row['Year']) & (historical_df['Week'] < row['Week']))) &
(historical_df['Company'] == row['Company'])
]
filtered_historical_df = filtered_historical_df.sort_values(by=['Year', 'Week'], ascending=False)
recent_weeks = filtered_historical_df.head(16)
if recent_weeks.shape[0] > 1:
if recent_weeks['Volume'].mean() != 0:
cv = recent_weeks['Volume'].std() / recent_weeks['Volume'].mean()
else:
cv = None
else:
cv = None
variability_calcs.append(cv)
input_df['Variability'] = variability_calcs
return input_df
Output:
input_df =
Year | Week | Company | Variability |
---|---|---|---|
2024 | 4 | A | 0.334 |
2023 | 18 | A | 0.242 |
2023 | 10 | B | 0.455 |
5