I’m curious about a weird behavior I got while using Pandas.
My intial purpose was, for each group in my data, to replace all values in a column with NA when said column contains more than x% missing values and to keep the original values otherwise, NA included.
To do so I’ve tried using groupby
and transform
with a lambda containing a conditional statement on x.isna().mean()
. It works for the most part but gives weird results when some specific conditions are met.
Here’s a reproducible example with a toy dataframe ; the threshold is set to 60% missing values :
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"A" : [np.nan, 4.7, 6.6, np.nan, np.nan, 5.4, 6., 5.3],
"B" : [np.nan, np.nan, 7.2, 15., np.nan, 5.5, np.nan, np.nan],
"C" : ["D", "D", "D", "E", "E", "F", "F", "F"]
}
)
df.groupby("C").transform(lambda x : x if x.isna().mean() < .6 else np.nan)
The inital data :
A B C
0 NaN NaN D
1 4.7 NaN D
2 6.6 7.2 D
3 NaN 15.0 E
4 NaN NaN E
5 5.4 5.5 F
6 6.0 3.1 F
7 5.3 NaN F
What I expect :
A B
0 NaN NaN
1 4.7 NaN
2 6.6 NaN
3 NaN 15.0
4 NaN NaN
5 5.4 5.5
6 6.0 3.1
7 5.3 NaN
What I get :
A B
0 NaN NaN
1 4.7 NaN
2 6.6 NaN
3 NaN 3 15.0 4 NaN Name: B, dtype: float64
4 NaN 3 15.0 4 NaN Name: B, dtype: float64
5 5.4 5.5
6 6.0 3.1
7 5.3 NaN
My problem lies in rows 3 and 4 where the entire series is returned instead of the atomic values.
After a few tests, it seems to happen when two conditions are met :
- the grouped values in the first column are set to NaN ;
- the values in the second column are supposed to stay the same.
If I switch to x.notna()
in my condition, the same problem happens when column A contains only valid values and the error always appears in the following columns.
I understand there are other ways to get the wanted result in Pandas so feel free to make suggestions but I’d really like to understand what is happening here : is my code faulty in some way, can it be easily corrected or is it some kind of bug ?
Thanks for your help and sorry if my English is a bit clunky 🙂
TimDdckr is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.