I have a Dataframe in the form of a genealogy tree with the following columns – (“Generation”, “Child_name”, “child_hair_color”, “Parent_name”, “parent_hair_color”, “parent_eye_color”).
The oldest generation has the assigned value “0” in the column generation, the youngest I assume the maximum value.
I’d like to transfer the information about black hair color from daughter to mother, from mother to grandmother, etc. (but it is important to do this step by step not go through whole Dataframe once).
The only one condition is to stop to transfer information when parent_eye_color would be “Hazel”.
My idea was to use code below (but after running it I don’t see any difference between input and output Dataframe). It is important for me that the code would work effectively on big datasets (30-700k rows).
selected_column = "Generation"
# Use function agg() to find min and max from "Generation" column
min_max_values = df.agg(F.min(selected_column).alias("min_value"), F.max(selected_column).alias("max_value")).first()
max_value = min_max_values["max_value"]
# Loop through generations
for current_level in range(max_value , 0, -1):
df = df.withColumn("parent_hair_color",
when((df.child_hair_color == "Black") & (df.Generation == current_level) & (df.parent_eye_color != "Hazel"),
("Black"))
.otherwise(F.col("parent_hair_color")))
df1 = df.filter(df.parent_hair_color == "Black").select(F.col("Parent_name").alias("Parent"))
df = df.join(df1, df.Child_name == df1.Parent, "left").select(df["*"], df1["Parent"])
df = df.withColumn("child_hair_color",
when(df.Child_name == df.Parent,
("Black"))
.otherwise(F.col("child_hair_color")))
df = df.drop("Parent")
user25190795 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.