I find duplicate rows in a dataframe using pd.duplicated() function. I would like to keep the first row, adding values of columns in duplicate rows and then reset index of the duplicate rows.
The example code is:
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
df is:
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
After applying the pandas functions, I would like to have the desired output as:
brand style rating
0 Yum Yum cup 8.0
1 Indomie cup 3.5
2 Indomie pack 20.0
The first rows were duplicate, so the first one was kept and the values in the ‘rating’ column were summed. Similar is the case for the 4th and 5th row.
After keep the first of the duplicate rows, anding the corresponding rating values, the index needs to reset for the three output rows in data frame.
How can I achieve this? I have tried several options, but because of the presence of multiple pairs of duplicate rows, and no prior experience with pd.groupby() and pd.aggregate(), I haven’t been successful.
0
you can use groupby.sum()
then reset_index
:
df = df.groupby(['brand', 'style']).sum().reset_index()
Output:
brand style rating
0 Indomie cup 3.5
1 Indomie pack 20.0
2 Yum Yum cup 8.0