I am using PySpark and have a dataframe the has two columns a
and b
, there is only one value in the dataframe per column to start. There can (but won’t always) be duplicate values in b
across rows. The objective is to transform the dataframe to have an array of all of the overlapping values in b
in a single row with an array of the corresponding values of a
, removing the old rows during the process. Rows where b
does not overlap with another row should remain unaltered.
If this was my data:
data = [
('00003-01', 4249300705),
('00003-01', 4249331199),
('00004-10', 4242100870),
('00004-10', 4242180791),
('00005-01', 4249301111),
('00005-01', 4249332222),
('00006-10', 4242103333),
('00006-10', 4242184444)
]
columns = ["a", "b"]
df = spark.createDataFrame(data, schema=columns)
After running the method, df.show(truncate=False)
would be:
+-----------------------+------------------------------------------------
|a |b |
+------------------------------------------------------------------------
|[00003-01, 00004-10] |[4249300705, 4249331199, 4242180791, 4242100870]|
|[00005-01, 00006-10] |[4249332222, 4249301111, 4242184444, 4242103333]|
+-----------------------+------------------------------------------------
I’ve tried a few ways using joins but it seems like when there is more than two instances where they need to be meshed together into one row (not ruling this out – I’ve just had no luck), this method does not work very well. I think it could be accomplished with some nested loops (and I’m not totally against that if it’s the only way) but I feel like there must be something more clever and efficient to solve this.