Hi I have a dataset like this :
id | name |
---|---|
1 | A |
1 | null |
2 | A |
3 | B |
4 | null |
4 | B |
5 | A |
6 | null |
And I want to remove duplicates row and keep the row where the name is not null
This is the expected output
id | name |
---|---|
1 | A |
2 | A |
3 | B |
4 | B |
5 | A |
6 | null |
I tried this :
df
.orderBy(“name”, ascending=False)
.dropDuplicates([“id”])
.show(10, False)
It remove the duplicates rows but I get null values in the “name” column
Thanks in advance for helping