I have a table that looks like this:
src | dst | type |
---|---|---|
2 | 1 | A |
2 | 1 | B |
3 | 2 | T |
4 | 3 | B |
4 | 3 | T |
I would like to keep one row per src
& dst
pair. That is:
src | dst |
---|---|
2 | 1 |
3 | 2 |
4 | 3 |
This table has billions of rows and I’m looking for the most efficient way to achieve this.
Option 1 – dropDuplicates
df = df.dropDuplicates(subset=['src', 'dst'])
Option 2 – groupBy + First
df = df.groupBy('src', 'dst').select(
F.first('src').alias('src'),
F.first('dst').alias('dst')
)
I suppose I could also use row_number
to partition by src
and dst
, randomly sort the partitions and keep their first record.
What’s the most efficient way to do this?