I would like to impute some missing values in a pandas data frame using .bfill() or .ffill(). I want to use .groupby() first. The problem is that .bfill() and .ffill() depend on previous or next rows and I am using .groupby() on multiple columns.
Here is a MWE with what I tried:
import pandas as pd
data = [['Joe Biden', 'a', '2021-10-31', 43],
['Donald Trump', 'lv', '2021-10-31', 42],
['Joe Biden', 'lv', '2021-10-31', 41],
['Joe Biden', 'lv', '2021-11-01', ],
['Joe Biden', 'rv', '2021-11-01', 42.5]]
df = pd.DataFrame(data, columns=['candidate', 'pop', 'date', 'avg'])
df['avg'] = df.groupby(['candidate', 'pop', 'date'])['avg'].transform(lambda x: x.bfill())
What I want to happen is that the NaN becomes a 41, the ‘avg’ value for ‘Joe Biden’, ‘lv’, on the previous date, ‘2021-10-31’, within the group.
But it’s hard to tell what to expect because .bfill() should look at the previous row within the group — since I have grouped by multiple columns, is it the previous ‘candidate’, the previous ‘pop’, or the previous ‘date’? I want it to be the previous ‘date’.
The result of the above code is that nothing gets imputed. Is there a (quicker, if possible) way to do what I’m trying to do?
(If I can get this to work then I also want to do something similar with a rolling average of ‘avg’ within those same groups.)
leyjfk6 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.