I have 2 dataframes
; one that acts as a lookup dataframe
, and another that I insert values into where numerous rows match between them.
The lookup dataframe
looks like this:
data = {'store': ['1', '1', '1', '1', '1', '1'], 'department': ['produce', 'produce', 'produce', 'bakery', 'meat', 'bakery'], 'task': ['water', 'stock', 'water', 'bread', 'stock', 'doughnuts'],
'employee': ['232', '111', '232', '121', '333', '121'], 'step': ['A', 'B', 'C', 'A', 'C', 'A'], 'month_year_date': ['1991-01', '1991-01', '1991-01', '1991-02', '1991-03', '1991-02'],
'work_time': ['1.2', '2.4', '1.1', '4.0', '1.5', '3.5']}
df_1 = pd.DataFrame.from_dict(data)
store | department | task | employee | step | month_year_date | work_time |
---|---|---|---|---|---|---|
1 | produce | water | 232 | A | 1991-01 | 1.2 |
1 | produce | stock | 111 | B | 1991-01 | 2.4 |
1 | produce | water | 232 | C | 1991-01 | 1.1 |
1 | bakery | bread | 121 | A | 1991-02 | 4.0 |
1 | meat | stock | 333 | C | 1991-03 | 1.5 |
1 | bakery | doughnuts | 121 | A | 1991-02 | 3.5 |
and the insert dataframe
looks like this:
data2 = {'store': ['1', '1', '1', '1', '1', '1'], 'department': ['produce', 'produce', 'produce', 'bakery', 'meat', 'bakery'], 'task': ['water', 'stock', 'water', 'bread', 'stock', 'doughnuts'],
'employee': ['232', '111', '232', '121', '333', '121'], 'step': ['A', 'B', 'A', 'A', 'C', 'A'], 'month_year_date': ['1991-01', '1991-01', '1991-02', '1991-02', '1991-03', '1991-02']}
df_2 = pd.DataFrame.from_dict(data2)
df_2[['A', 'B', 'C']] = 0
store | department | task | employee | step | month_year_date | A | B | C |
---|---|---|---|---|---|---|---|---|
1 | produce | water | 232 | A | 1991-01 | 0 | 0 | 0 |
1 | produce | stock | 111 | B | 1991-01 | 0 | 0 | 0 |
1 | produce | water | 232 | A | 1991-02 | 0 | 0 | 0 |
1 | bakery | bread | 121 | A | 1991-02 | 0 | 0 | 0 |
1 | meat | stock | 333 | C | 1991-03 | 0 | 0 | 0 |
1 | bakery | doughnuts | 121 | A | 1991-02 | 0 | 0 | 0 |
My goal is to create a dataframe
(df_2
) where each row is a unique combination of store
, department
, task
, employee
, step
, and month_year_date
with the values for columns
A
, B
, C
corresponding to the work_time
from df_1
. I’m currently using something like this:
for i in range(len(df_2)):
indx = np.where((df_1.department[i] == df_1.department) & (df_2.task[i] == df_1.task) & (df_2.employee[i] == df_1.employee) & (df_2.month_year_date[i] == df_1.month_year_date))
for n in range(0, len(indx[0])):
df_2.loc[i, df_1.step[indx[0][n]]] = df_1.loc[indx[0][n]].work_time
store | department | task | employee | step | month_year_date | A | B | C |
---|---|---|---|---|---|---|---|---|
1 | produce | water | 232 | A | 1991-01 | 1.2 | 0 | 1.1 |
1 | produce | stock | 111 | B | 1991-01 | 0 | 2.4 | 0 |
1 | produce | water | 232 | A | 1991-02 | 0 | 0 | 0 |
1 | bakery | bread | 121 | A | 1991-02 | 4.0 | 0 | 0 |
1 | meat | stock | 333 | C | 1991-03 | 0 | 0 | 1.5 |
1 | bakery | doughnuts | 121 | A | 1991-02 | 3.5 | 0 | 0 |
which yields what I’m looking for, but it takes a significant amount of time given larger dataframes
(2+ million rows).
My question: Is there a more efficient way to accomplish this task? I know that nested for
loops create exponential time complexities, and would love to speed this step up!