I currently have a pyspark dataframe like this:
Key | Col1 | Col2 | Col3 | Col4 | col_changed |
---|---|---|---|---|---|
X | 2 | 1 | 2 | 3 | Col1 |
X | 3 | 1 | 2 | 6 | Col4 |
X | 3 | 2 | 2 | 3 | Col2 |
There are a lot of duplicates per Key, i would like to merge them all together and keep the value of the Column that is specified in “col_changed” column. The rest of the column will be exact copy of one another (like Col3 here) so taking the first value is fine.
Example outcome in this case:
Key | Col1 | Col2 | Col3 | Col4 |
---|---|---|---|---|
X | 2 | 2 | 2 | 6 |
1