I am trying to aggregate sales data for the summer, for each customer. So I am summing up “sales”, grouping by customer – and I’d like only date > 2020-07-01
<code>df = pd.DataFrame({
"customer": [1,1,2,2,3,3],
"sales": [56,69,32,2,45,90],
"date": ['2020-09-01','2020-08-01','2020-07-01',
'2020-08-01','2020-02-01','2020-01-01']})
df['date'] = pd.to_datetime(df['date'])
# this works successfully
new_df = df.groupby("customer")["sales"].sum().reset_index()
# this gives me an error
new_df_summer = df.groupby("customer").filter(
lambda x: x["date"] > "2020-06-01"
).groupby("customer")["sales"].sum().reset_index()
</code>
<code>df = pd.DataFrame({
"customer": [1,1,2,2,3,3],
"sales": [56,69,32,2,45,90],
"date": ['2020-09-01','2020-08-01','2020-07-01',
'2020-08-01','2020-02-01','2020-01-01']})
df['date'] = pd.to_datetime(df['date'])
# this works successfully
new_df = df.groupby("customer")["sales"].sum().reset_index()
# this gives me an error
new_df_summer = df.groupby("customer").filter(
lambda x: x["date"] > "2020-06-01"
).groupby("customer")["sales"].sum().reset_index()
</code>
df = pd.DataFrame({
"customer": [1,1,2,2,3,3],
"sales": [56,69,32,2,45,90],
"date": ['2020-09-01','2020-08-01','2020-07-01',
'2020-08-01','2020-02-01','2020-01-01']})
df['date'] = pd.to_datetime(df['date'])
# this works successfully
new_df = df.groupby("customer")["sales"].sum().reset_index()
# this gives me an error
new_df_summer = df.groupby("customer").filter(
lambda x: x["date"] > "2020-06-01"
).groupby("customer")["sales"].sum().reset_index()
I get this error: TypeError: filter function returned a Series, but expected a scalar bool
2
You shouldn’t use groupby.filter
(which is to filter groups not rows).
Instead, use loc
:
<code>new_df_summer = (
df.loc[lambda x: x['date'] > '2020-06-01']
.groupby('customer')['sales']
.sum()
.reset_index()
)
</code>
<code>new_df_summer = (
df.loc[lambda x: x['date'] > '2020-06-01']
.groupby('customer')['sales']
.sum()
.reset_index()
)
</code>
new_df_summer = (
df.loc[lambda x: x['date'] > '2020-06-01']
.groupby('customer')['sales']
.sum()
.reset_index()
)
Output:
<code> customer sales
0 1 125
1 2 34
</code>
<code> customer sales
0 1 125
1 2 34
</code>
customer sales
0 1 125
1 2 34
If you wanted to keep the groups that have at least one matching date (but still aggregate all dates!), you could have used:
<code>df.groupby('customer').filter(
lambda x: x['date'].gt('2020-06-01').any()
).groupby('customer')['sales'].sum().reset_index()
</code>
<code>df.groupby('customer').filter(
lambda x: x['date'].gt('2020-06-01').any()
).groupby('customer')['sales'].sum().reset_index()
</code>
df.groupby('customer').filter(
lambda x: x['date'].gt('2020-06-01').any()
).groupby('customer')['sales'].sum().reset_index()
Output:
<code> customer sales
0 1 125
1 2 34
</code>
<code> customer sales
0 1 125
1 2 34
</code>
customer sales
0 1 125
1 2 34
See how group 3
is gone, and you have the total for the remaining groups.
0