I see one of post in stack overflow, am also looking the same solution.
I think pyspark doesn’t support the LAG on derived column of previous record. Please share if you find the solution.
column A. Column B. Column C. Column D
=========. ======== ========. ========
1234. 202401. 123 A
1234 202402. 345. null
1234 202403 50 null
Need to apply the LAG on top of Column C and Column D Columns
if LAG(Column C) > 75 and LAG(Column D) == ‘A’ then Column D Value is ‘A’ Otherwise ‘N’
Example:
df = df.withColumn(“Column B”, when((col(“Column D”).isNull()) & (lag(“Column D”).over(windowSpec) == “A”) & (col(“Column C”) > 75),”N”) .otherwise(df[“Column D”]))
problem – only Second row is updating.. Not considering the previous derived column..
krishna ranga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.