I have the following dataframe:
df = spark.createDataFrame(
[
("D1", "D2", "H1", None, None),
("D1", "D2", "H1", "H2", None),
("D1", "D2", "H1", "H2", "H3")
],
["Dimension1", "Dimention2", "Hierarchy1", "Hierarchy2", "Hierarchy3"]
)
I want to transform it so that it becomes something like this instead:
new_df = spark.createDataFrame(
[
("D1", "D2", "H1"),
("D1", "D2", "H2"),
("D1", "D2", "H3")
],
["Dimension1", "Dimention2", "Hierarchy"]
)
The logic being:
whencondition = when((col("Hierarchy1").isNotNull()) & (col("Hierarchy2").isNull()) & (col("Hierarchy3").isNull()), lit("H1")).when((col("Hierarchy1").isNotNull()) & (col("Hierarchy2").isNotNull()) & (col("Hierarchy3").isNull()), lit("H2")).when((col("Hierarchy1").isNotNull()) & (col("Hierarchy2").isNotNull()) & (col("Hierarchy3").isNotNull()), lit("H3")).alias("Hierarchy")
display(df.select("Dimension1", "Dimention2", whencondition))
There can be of course be any number of hierarchy columns, but in the end output I only want there to be one column to show what level of hierarchy that record is at. I started off by creating a list
hierarchies = ["Hierarchy1", "Hierarchy2", "Hierarchy3"]
and got as far as this:
when(reduce(lambda x, y: x & y, [(col( "`" + x + "`").isNotNull()) if x in hierarchies[:i+1] else (col( "`" + x + "`").isNull()) for x in hierarchies]), lit(hierarchies[i]))
which works for i < len(hierarchies)
, but not any further unfortunately