I would like to merge rows in PySpark DataFrame based on partial overlapping values in a column. Here’s a simplified example:
| values | sources |
| -------- | -------------- |
| [a, b] | [s1, s2, s3] |
| [b, c, d]| [s5, s1] |
| [x, y] | [s7] |
In this example, since row 1 and 2 share the one of the same sources, s1
, then merge the values, And [x, y]
doesn’t share sources with others so leave them alone. So the expected output would be something like:
I started by exploding the sources column to the following but not sure how to go from there. Thank you in advance!