I have a pyspark dataframe
deviceId timestamp
009eeb 2024-04-22
009eeb 2024-04-24
7c002v 2024-04-20
7c002v null
4fd556 null
4fd556 null
I need to get the max timestamp in the final dataframe and drop duplicates.
I tried with the below code
w = Window.partitionBy('deviceId')
df_max = df.withColumn('max_col', F.max('timestamp').over(w))
.where((F.col('timestamp') == F.col('max_col')) | (F.col('timestamp').isNull()))
.dropDuplicates()
But from this code, I’m getting what I need to get.
deviceId timestamp
009eeb 2024-04-24
7c002v null
4fd556 null
4fd556 null
I need to get the result as below
deviceId timestamp
009eeb 2024-04-24
7c002v 2024-04-20
4fd556 null
Please suggest any changes to be done