I have a puzzling situation with pandas groupby objects. I’m in a situation where I have a dataset with ids, features, and targets for training a machine learning model. In some cases, there are groups of features with differing target values, and since that doesn’t make sense, I would like to compute the mean of target values within those groups.
id_cols = list(df.columns[:4])
features = list(df.columns[4:-1])
target = df.columns[-1]
ad_id = id_cols[1]
creative_id = id_cols[-1]
Unfortunately though, as I add a larger number of features names (there are around 200) into the groupby operation, the aggregated (means) dataset changes shape. In my understanding, the shape of the resulting dataset of means should be exactly the number of unique groups. After adding a threshold number of features, the number of entries in the aggregated dataset goes down to small numbers:
for n in [10,20,30, 35, 40, 50,100,200]:
grpby = df.groupby(features[:n])
mean_targets = grpby[target].agg(["mean"])
print(n, len(grpby), mean_targets.shape)
# 10 1349 (1349, 1)
# 20 1882 (1882, 1)
# 30 1978 (1978, 1)
# 35 1978 (31, 1)
# 40 1978 (31, 1)
# 50 1978 (31, 1)
# 100 1978 (19, 1)
# 200 4870 (2, 1)
As you can see, after I add 35 features, my mean_targets series shape doesn’t match the number of groups in the groupby object anymore.
What could I be doing wrong – or could this be a pandas groupby limitation?