I have a multiindex Pandas DataFrame and I’m using groupby to extract the rows containing the first appearances of the first index.
After this operation, however, the output column values does not always correspond to the original values.
Here is a simple example to reproduce this behaviour:
df = pd.DataFrame([{'myIndex1' : 'A', 'myIndex2' : 0, 'C1' : 1.0, 'C2' : None},
{'myIndex1' : 'A', 'myIndex2' : 1, 'C1' : 0.5, 'C2' : 'ca'},
{'myIndex1' : 'B', 'myIndex2' : 0, 'C1' : 3.0, 'C2' : 'cb'},
{'myIndex1' : 'C', 'myIndex2' : 0, 'C1' : 2.0, 'C2' : 'cc'}])
df.set_index(['myIndex1','myIndex2'],inplace=True)
df
Now if I use groupby to extract the first appearances of myIndex1:
df.groupby(level='myIndex1').first()
So the column ‘C2’ for the first appearance of myIndex1 = A is no longer None, but it has been changed to ‘ca’.
I’ve checked that this happens if the column value is None or NaN.
Of course, I can replace these values, but I would like to avoid that.
Any thoughts about what could be causing this behavior and how I can avoid it?
Thanks!
By default, groupby.first
skips the NaN values, use skipna=False
:
df.groupby(level='myIndex1').first(skipna=False)
Note that you could also use groupby.nth
:
df.groupby(level='myIndex1').nth(0)
Or groupby.head
:
df.groupby(level='myIndex1').head(1)
Output:
C1 C2
myIndex1
A 1.0 None
B 3.0 cb
C 2.0 cc
1