I want to get the information per Name.
df=pd.DataFrame({'Name':['A','A','A','A','B','B','B','B'],
'Pair':['b','c','d','e','b','f','d','g'],
'List':['1,2','2,3,4',np.nan,'5',np.nan,'1,2,3','3,6','5,4']})
My output, so basically, I want to split ‘Pair’ and ‘List’ into digits and join them.
-
‘Unique Pair’: group by each name and split it and drop NA and join all unique letters and sort letters.
-
‘Unique Item’: same logic as of ‘Unique Pair’
-
‘All Item’: the only difference is that it does not join unique letters but all letters appeared and sort letters. For example, there are two ‘2’ for name A, so it appears twice.
out=pd.DataFrame({'Name':['A','B'],
'Unique Pair':['b,c,d,e','b,d,f,g'],
'Unique Item':['1,2,3,4,5','1,2,3,4,5,6'],
'All Item':['1,2,2,3,4,5','1,2,3,3,4,5,6']})
I have tried this but I don’t know how to append all items together.
df.groupby('Name')['List'].fillna('NA').apply(lambda x: x.split(','))
7
Here’s one approach with df.groupby
:
out = df.groupby('Name', as_index=False).agg(
**{
'Unique Pair': ('Pair', lambda x:
','.join(np.unique(x))
),
'Unique Item': ('List', lambda x: ','.join(
np.unique(x.dropna().str.split(',').explode()))
),
'All Item': ('List', lambda x: ','.join(
x.dropna().str.split(',').explode().sort_values())
)
})
Output:
Name Unique Pair Unique Item All Item
0 A b,c,d,e 1,2,3,4,5 1,2,2,3,4,5
1 B b,d,f,g 1,2,3,4,5,6 1,2,3,3,4,5,6
Explanation
Use groupby.agg
:
- for
'Unique Pair'
usestr.join
after applyingnp.unique
, which returns sorted unique elements. - for
'Unique Item'
dropNaN
values withSeries.dropna
, applySeries.str.split
andSeries.explode
. Pass result tonp.unique
and usestr.join
. - for
'All Item'
, do the same, but afterexplode
, chainSeries.sort_values
and then usestr.join
.
On the use of named aggregations, see here.
To avoid repetition of x.dropna().str.split(',').explode()
(which is expensive) for both 'Unique Item'
and 'All Item'
, you can also adjust your df
before starting the groupby
aggregations. E.g., do something like this:
df = (df.assign(List=df['List'].dropna().str.split(','))
.explode('List', ignore_index=True)
)
out2 = df.groupby('Name', as_index=False).agg(
**{
'Unique Pair': ('Pair', lambda x: ','.join(np.unique(x))),
'Unique Item': ('List', lambda x: ','.join(np.unique(x.dropna()))),
'All Item': ('List', lambda x: ','.join(x.dropna().sort_values()))
})
Output:
out2.equals(out)
# True
If performance is important, and you are working with a large dataset, this will probably be faster.