My SparkSQL DataFrame looks like this:
<code>+-----------+----------+----------+----------+
|a |b |c |d |
+-----------+----------+----------+----------+
| 123| abc| N| 2|
| 123| abc| N| 4|
| 123| abc| X| 3|
| 456| def| K| 1|
| 456| def| X| 4|
+-----------+----------+----------+----------+
</code>
<code>+-----------+----------+----------+----------+
|a |b |c |d |
+-----------+----------+----------+----------+
| 123| abc| N| 2|
| 123| abc| N| 4|
| 123| abc| X| 3|
| 456| def| K| 1|
| 456| def| X| 4|
+-----------+----------+----------+----------+
</code>
+-----------+----------+----------+----------+
|a |b |c |d |
+-----------+----------+----------+----------+
| 123| abc| N| 2|
| 123| abc| N| 4|
| 123| abc| X| 3|
| 456| def| K| 1|
| 456| def| X| 4|
+-----------+----------+----------+----------+
I want to remove duplicates based on the key columns a
and b
that don’t have value N
in column c
IF there exists a single row for that key combination that has value N
in column c
.
Expected output:
<code>+-----------+----------+----------+----------+
|a |b |c |d |
+-----------+----------+----------+----------+
| 123| abc| N| 2|
| 123| abc| N| 4|
| 456| def| K| 1|
| 456| def| X| 4|
+-----------+----------+----------+----------+
</code>
<code>+-----------+----------+----------+----------+
|a |b |c |d |
+-----------+----------+----------+----------+
| 123| abc| N| 2|
| 123| abc| N| 4|
| 456| def| K| 1|
| 456| def| X| 4|
+-----------+----------+----------+----------+
</code>
+-----------+----------+----------+----------+
|a |b |c |d |
+-----------+----------+----------+----------+
| 123| abc| N| 2|
| 123| abc| N| 4|
| 456| def| K| 1|
| 456| def| X| 4|
+-----------+----------+----------+----------+
For key combination a = 123
, b = abc
there exists more than one row where c = N
. Therefore we want to remove all rows for that key combination where c != N
.
For key combination a = 456
, b = def
there doesn’t exist any rows where c = N
. Therefore we do not want to remove any rows.