I have a sample dataset and code snippet in PySpark where I’m attempting to generate a hash value for each row in a DataFrame. Here’s the code I’m using:
from pyspark.sql.functions import col, concat_ws, sha2, concat, hash
from datetime import datetime, date
df = spark.createDataFrame([
("user1", "Bangalore", 'Grade1', date(2000, 8, 1)),
("user2", "Delhi", 'Grade2', date(2000, 6, 2)),
("user3", "Delhi", 'Grade2', date(2000, 6, 2)),
("user4", "Chennai", 'Grade3', date(2000, 5, 3)),
("user5", None, 'Grade3', date(2000, 5, 3)) ,
("user5", "Grade3", None, date(2000, 5, 3)),
("user6", "", 'Grade4', date(2000, 5, 3)) ,
("user6", "Grade4", "", date(2000, 5, 3))
], schema='userId string, city string, grade string, graduationDate date')
df_hash = (
df.withColumn(
"_delta_hash_sha2_concat_ws", sha2(concat_ws("-", *["userId", "city", "grade", "graduationDate"]), 384)
)
.withColumn("_delta_hash_sha2_concat", sha2(concat("userId", "city", "grade", "graduationDate"), 256))
.withColumn("_delta_hash_hash", hash(*["userId", "city", "grade", "graduationDate"]))
)
display(df_hash)
In my dataset, some rows contain null values, and I prefer to retain these nulls rather than replacing them with empty strings or any other placeholder.
However, when using sha2 to generate hash values, I’ve noticed that rows with null values are generating the same hash. For instance, in the case of “user5”, both rows generate the same hash despite having different null placements.
Upon further inspection, it seems that sha2 is not considering the position of null values when generating hash values. Therefore, I’m seeking suggestions on how to generate unique hash values for each row while preserving null values without replacing them. I would like to avoid using row_number to assign row IDs.
Important Note: I do not wish to replace null values with empty strings or “NULL”. I aim to keep null values intact.
I appreciate any insights or alternative approaches to achieve this goal. Thank you!
ganesh acharya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1