Main problem
I’m trying to find a way to select the rows that make up the largest sub-group inside another group in a Pandas DataFrame and I’m having a bit of a hard time.
Visual example (code below)
Here is a sample dataset to help explain exactly what it is I’m trying to do. There is code below to recreate this dataset yourself, if you’d like.
Suppose I want to group this table by Col1
and figure out which unique value of Col2
has the most rows (within each group of Col1
). Further, I don’t want to just know which group is the largest – I want to find a way to select the rows from the original DataFrame that match that description.
So, in this case, we can easily see that, for Col1=="Group A"
, the value of Col2
with most rows is "Type 3"
, and for Col1=="Group B"
, the value of Col2
with most rows is "Type 6"
.
That means that I want to select the rows with RowID in [10005, 10006, 10007, 10008, 10009, 10010, 10011, 10012, 10013, 10014, 10015]
.
Therefore, the output I’m looking for would be the following:
My clumsy attempt
I found a solution, but it is tremendously convoluted. Here is a step-by-step explanation of what I did:
Step 1: First, for each group of Col1
, I want to tally up the number of rows that exist for each unique value of Col2
. That’s quite easy, I can just do a simple groupby(['Col1','Col2'])
and see the size of each grouping.
Here is how that looks:
Notice that for Col1=="Group A"
, Col2=="Type 1"
has 2 observations, Col2=="Type 2"
has 2 observations, and Col2=="Type 3"
has 6 observations – as expected from our original data.
Step 2: This is trickier: for each group of Col1
, I want to find the value of Col2
that has the biggest number of rows from Step 1.
Here is how that looks:
Notice that we only see the cases with “max rows”.
Step 3: Lastly, I want to filter the original data to show ONLY the rows that fit that specific grouping: the ones found in Step 2.
Reproducible example and code
Here’s some code to illustrate my example:
# Importing the relevant library
import pandas as pd
# Creating my small reproducible example
my_df = pd.DataFrame({'RowID':[10001,10002,10003,10004,10005,10006,10007,10008,10009,10010,10011,10012,10013,10014,10015,10016,10017,10018,10019,10020],
'Col1':['Group A','Group A','Group A','Group A','Group A','Group A','Group A','Group A','Group A','Group A','Group B','Group B','Group B','Group B','Group B','Group B','Group B','Group B','Group B','Group B'],
'Col2':['Type 1','Type 1','Type 2','Type 2','Type 3','Type 3','Type 3','Type 3','Type 3','Type 3','Type 6','Type 6','Type 6','Type 6','Type 6','Type 3','Type 3','Type 2','Type 2','Type 2'],
'Col3':[100,200,300,400,500,600,700,800,900,1000,2000,1900,1800,1700,1600,1500,1400,1300,1200,1100],
'Col4':['Alice','Bob','Carl','Dave','Earl','Fred','Greg','Henry','Iris','Jasmine','Kris','Lonnie','Manny','Norbert','Otis','Pearl','Quaid','Randy','Steve','Tana']})
# Solving Step 1: finding the unique groupings and their relative sizes
temp1 = my_df.groupby(['Col1','Col2']).agg({'RowID':'count'}).reset_index()
# Solving Step 2: finding which grouping is the largest
temp2 = temp1.groupby(['Col1']).agg({'RowID':'max'}).reset_index()
# Solving Step 3: finding which rows of the original DataFrame match what was
# found in Step 2
# Step 3 Part 1: Finding the actual combination of `Col1` & `Col2` that let to
# the largest number of rows
temp3 = (temp1
.rename(columns={'RowID':'RowID_count'})
.merge(temp2
.rename(columns={'RowID':'RowID_max'}),
how='left',
on='Col1')
.assign(RowID_ismax = lambda _df: _df['RowID_count']== _df['RowID_max'])
.query('RowID_ismax')
.drop(columns=['RowID_count','RowID_max']))
# Step 3 Part 2: Finding the matching rows in the original dataset and
# filtering it down
result = (my_df
.merge(temp3,
how='left',
on=['Col1','Col2'])
.assign(RowID_ismax = lambda _df: _df['RowID_ismax'].fillna(False))
.query('RowID_ismax')
.reset_index(drop=True)
.drop(columns=['RowID_ismax']))
The solution above is EXTREMELY convoluted, full of assign
and lambda
statements alongside several sequential groupby
s and reset_index
s, which suggest to me I’m approaching this the wrong way.
Any help on this would be greatly appreciated.
A short code to perform this could value_counts
+ idxmax
, then merge
:
keep = my_df[['Col1', 'Col2']].value_counts().groupby(level='Col1').idxmax()
out = my_df.merge(pd.DataFrame(keep.tolist(), columns=['Col1', 'Col2']))
Output:
RowID Col1 Col2 Col3 Col4
0 10005 Group A Type 3 500 Earl
1 10006 Group A Type 3 600 Fred
2 10007 Group A Type 3 700 Greg
3 10008 Group A Type 3 800 Henry
4 10009 Group A Type 3 900 Iris
5 10010 Group A Type 3 1000 Jasmine
6 10011 Group B Type 6 2000 Kris
7 10012 Group B Type 6 1900 Lonnie
8 10013 Group B Type 6 1800 Manny
9 10014 Group B Type 6 1700 Norbert
10 10015 Group B Type 6 1600 Otis
Intermediates:
# my_df[['Col1', 'Col2']].value_counts()
Col1 Col2
Group A Type 3 6
Group B Type 6 5
Type 2 3
Group A Type 1 2
Type 2 2
Group B Type 3 2
Name: count, dtype: int64
# my_df[['Col1', 'Col2']].value_counts().groupby(level='Col1').idxmax()
Col1
Group A (Group A, Type 3)
Group B (Group B, Type 6)
Name: count, dtype: object