Sample data:
data = [[1, '[email protected]'], [2, '[email protected]'], [3, '[email protected]']]
person = pd.DataFrame(data, columns=['id', 'email']).astype({'id':'int64', 'email':'object'})
Reproducible code:
(person.sort_values(by = ['email', 'id'], ascending = [True, True])
.drop_duplicates(subset = 'email', keep = 'first', inplace = True))
I expected the code above to revise person
so it looks like
id email
1 2 [email protected]
0 1 [email protected]
But instead person
still looks like its original form
id email
0 1 [email protected]
1 2 [email protected]
2 3 [email protected]
If I break up the methods into two parts, then it works
person1 = person.sort_values(by = ['email', 'id'], ascending = [True, True])
person1.drop_duplicates(subset = 'email', keep = 'first', inplace = True)
In this case person1
looks like the desired format:
id email
1 2 [email protected]
0 1 [email protected]
Why doesn’t the first code remove duplicated email in-place?
0
Your first code is more or less equivalent to:
temporary = person.sort_values(by = ['email', 'id'], ascending = [True, True])
temporary.drop_duplicates(subset = 'email', keep = 'first', inplace = True)
del temporary
Thus, the temporary
dataframe is modified and then discarded
If you want to chain commands, you should not use inplace=True
, and assign the output to a variable:
out = (person.sort_values(by = ['email', 'id'], ascending = [True, True])
.drop_duplicates(subset = 'email', keep = 'first')
)
Output:
id email
1 2 [email protected]
0 1 [email protected]
Or only use in place commands:
person.sort_values(by = ['email', 'id'], ascending = [True, True], inplace=True)
person.drop_duplicates(subset = 'email', keep = 'first', inplace=True)
print(person)