I have the following 3 lines of code of Pandas groupby and apply that behaves differently in Pandas 1.3 vs Pandas 2.2.
df = pd.DataFrame({'group': ['A', 'A', 'B', 'B'], 'value': [1, 2, 3, 4]}).set_index(['group'])
print(df.groupby(level='group', group_keys=True).apply(lambda x: x))
print(df.groupby(level='group', group_keys=False).apply(lambda x: x))
So input looks like this
value
group
A 1
A 2
B 3
B 4
In Pandas 1. both resulting dataframes have a single index called group
.
value
group
A 1
A 2
B 3
B 4
In Pandas 2, the first version returns duplicate index called group
, while the second version gives single index called group
.
value
group group
A A 1
A 2
B B 3
B 4
and
value
group
A 1
A 2
B 3
B 4
I’m not 100% clear on what happened. From reading the doc on the group_keys
, it seems like this parameter controls whether the group key (‘group’ in my case) gets added to the index. In that sense, the Pandas 2 behavior seems to make more sense. So we get the result of the apply (which already has ‘group’ as index), and this parameter decides to whether add the group key to the index again. That’s why if you set it to True, you get duplicate index called group.
It does seem a bit confusing that the behavior for group_keys=True
is different between the 2 Pandas version. It seems like in Pandas 1, it dropped the duplicate index. And I don’t see a breaking change documentation about this parameter.
Does anyone have a better explanation on what happened?