Imagine a process, where we do several pandas groupbys.
We start with a df like so:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame({
'id': np.arange(10),
'a': np.random.randint(1, 10, 10),
'b': np.random.randint(1, 10, 10),
'c': np.random.randint(1, 10, 10)
})
df
Out[23]:
id a b c
0 0 2 8 2
1 1 8 8 9
2 2 7 2 9
3 3 3 8 4
4 4 5 1 9
5 5 6 7 8
6 6 3 8 4
7 7 5 7 7
8 8 3 2 6
9 9 5 1 2
We perform a groupby ‘a’ (The ‘max’ agg is just an example)
new_df = df.groupby('a').agg('max').reset_index()
Out[25]:
a id b c
0 2 0 8 2
1 3 8 8 6
2 5 9 7 9
3 6 5 7 8
4 7 2 2 9
5 8 1 8 9
And I want to keep track of the original id to which group it belongs.
For example,
id 0 belongs to a = 2,
1 to 8,
2 to 7,
(3, 6, 8) belongs to 3
etc..
Afterword we perform another groupby:
new_df.groupby('b').agg('max').reset_index()
Out[28]:
b a id c
0 2 7 2 9
1 7 6 9 9
2 8 8 8 9
Now we have a continued mapping,
Where
group a (2, 3, 8) belongs to group 8 (of b)
(5, 6) = 7
7 = 2
And this result in a long map where the original id:
0 => a = 2 => b = 8 (where b = 8 is the final group that interests me)
1 => a = 8 => b = 8
2 => a = 7 => b = 2
And so on..
Now I do this in order to reduce a lot of entities in my data so that I can group them in the same bucket somehow. And I need to map them from their original id to a new id, that is being represented after many iterations of groupby.
In the end I want to see something like so
Out[32]:
id grp
0 0 8
1 1 8
2 2 2
3 3 8
4 4 7
5 5 7
6 6 8
7 7 7
8 8 8
9 9 7
again, because id=0 went to a=2 and a=2 went to b=8..
Any solution or suggestion will be most welcome. Even carrying the values in a column dedicated to this, with each group by.
And when we aggregate, we can do a set addition…