I have a Pyspark dataframe like this
ID | source | dest | value1 |
---|---|---|---|
12345 | ABC | DEF | 500 |
12345 | DEF | ABC | 600 |
67890 | JKL | MNO | 100 |
67890 | MNO | JKL | 200 |
I want to add a new column value2
such that it is the value1
of a row that has the same ID
and the source
and dest
are swapped. The dataframe above would look like this (the second row with the same ID
would be removed).
ID | source | dest | value1 | value2 |
---|---|---|---|---|
12345 | ABC | DEF | 500 | 600 |
67890 | JKL | MNO | 100 | 200 |
I thought of using UDF, but I think UDF works on each row only, and cannot do comparisons with another row (say, get value1
of the row that matches the current row’s ID
and source = dest
).
Then I thought of splitting the dataframe into 2, where each contains only one of the two ID
, then somehow join the two dataframes, but I’m not sure if this is the best way to do it.
How can I get the dataframe I want?