Sample data:
<code>data = [[1, '[email protected]'], [2, '[email protected]'], [3, '[email protected]']]
person = pd.DataFrame(data, columns=['id', 'email']).astype({'id':'int64', 'email':'object'})
</code>
<code>data = [[1, '[email protected]'], [2, '[email protected]'], [3, '[email protected]']]
person = pd.DataFrame(data, columns=['id', 'email']).astype({'id':'int64', 'email':'object'})
</code>
data = [[1, '[email protected]'], [2, '[email protected]'], [3, '[email protected]']]
person = pd.DataFrame(data, columns=['id', 'email']).astype({'id':'int64', 'email':'object'})
Reproducible code:
<code>(person.sort_values(by = ['email', 'id'], ascending = [True, True])
.drop_duplicates(subset = 'email', keep = 'first', inplace = True))
</code>
<code>(person.sort_values(by = ['email', 'id'], ascending = [True, True])
.drop_duplicates(subset = 'email', keep = 'first', inplace = True))
</code>
(person.sort_values(by = ['email', 'id'], ascending = [True, True])
.drop_duplicates(subset = 'email', keep = 'first', inplace = True))
I expected the code above to revise person
so it looks like
<code> id email
1 2 [email protected]
0 1 [email protected]
</code>
<code> id email
1 2 [email protected]
0 1 [email protected]
</code>
id email
1 2 [email protected]
0 1 [email protected]
But instead person
still looks like its original form
<code> id email
0 1 [email protected]
1 2 [email protected]
2 3 [email protected]
</code>
<code> id email
0 1 [email protected]
1 2 [email protected]
2 3 [email protected]
</code>
id email
0 1 [email protected]
1 2 [email protected]
2 3 [email protected]
If I break up the methods into two parts, then it works
<code>person1 = person.sort_values(by = ['email', 'id'], ascending = [True, True])
person1.drop_duplicates(subset = 'email', keep = 'first', inplace = True)
</code>
<code>person1 = person.sort_values(by = ['email', 'id'], ascending = [True, True])
person1.drop_duplicates(subset = 'email', keep = 'first', inplace = True)
</code>
person1 = person.sort_values(by = ['email', 'id'], ascending = [True, True])
person1.drop_duplicates(subset = 'email', keep = 'first', inplace = True)
In this case person1
looks like the desired format:
<code> id email
1 2 [email protected]
0 1 [email protected]
</code>
<code> id email
1 2 [email protected]
0 1 [email protected]
</code>
id email
1 2 [email protected]
0 1 [email protected]
Why doesn’t the first code remove duplicated email in-place?
0
Your first code is more or less equivalent to:
<code>temporary = person.sort_values(by = ['email', 'id'], ascending = [True, True])
temporary.drop_duplicates(subset = 'email', keep = 'first', inplace = True)
del temporary
</code>
<code>temporary = person.sort_values(by = ['email', 'id'], ascending = [True, True])
temporary.drop_duplicates(subset = 'email', keep = 'first', inplace = True)
del temporary
</code>
temporary = person.sort_values(by = ['email', 'id'], ascending = [True, True])
temporary.drop_duplicates(subset = 'email', keep = 'first', inplace = True)
del temporary
Thus, the temporary
dataframe is modified and then discarded
If you want to chain commands, you should not use inplace=True
, and assign the output to a variable:
<code>out = (person.sort_values(by = ['email', 'id'], ascending = [True, True])
.drop_duplicates(subset = 'email', keep = 'first')
)
</code>
<code>out = (person.sort_values(by = ['email', 'id'], ascending = [True, True])
.drop_duplicates(subset = 'email', keep = 'first')
)
</code>
out = (person.sort_values(by = ['email', 'id'], ascending = [True, True])
.drop_duplicates(subset = 'email', keep = 'first')
)
Output:
<code> id email
1 2 [email protected]
0 1 [email protected]
</code>
<code> id email
1 2 [email protected]
0 1 [email protected]
</code>
id email
1 2 [email protected]
0 1 [email protected]
Or only use in place commands:
<code>person.sort_values(by = ['email', 'id'], ascending = [True, True], inplace=True)
person.drop_duplicates(subset = 'email', keep = 'first', inplace=True)
print(person)
</code>
<code>person.sort_values(by = ['email', 'id'], ascending = [True, True], inplace=True)
person.drop_duplicates(subset = 'email', keep = 'first', inplace=True)
print(person)
</code>
person.sort_values(by = ['email', 'id'], ascending = [True, True], inplace=True)
person.drop_duplicates(subset = 'email', keep = 'first', inplace=True)
print(person)