I would like to divide data ~600 000 rows into 6 small packages, ~100 000 each.
I used the code below, and during the last iteration of my while loop, everything freezes without any errors, and only a kernel restart helps.
I run spark in a cluster on my own infrastructure.
# Read lookup data
t0 = time.time()
data_lookup = spark.read.format("delta").load("hdfs://xxxxx:9000/hd_micula/delta_jsons/data_lookup")
t1 = time.time()
total = t1-t0
print(total)
# Total 661691
# Define the number of splits
n_splits = 5
# Calculate count of each dataframe rows
each_len = data_lookup.count() // n_splits
print(datetime.now(), 'Each length:', each_len)
# Create a copy of original dataframe
copy_df = data_lookup
# Iterate for each dataframe
i = 0
while i <= n_splits:
if i == n_splits:
temp_df = copy_df
else:
temp_df = copy_df.limit(each_len)
copy_df = copy_df.subtract(temp_df)
temp_df.show(1)
print(datetime.now(), i, temp_df.count())
del temp_df
spark.catalog.clearCache()
i += 1
2024-05-07 23:22:38.088099 Each length: 132338
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
| UUID| OBJECT_ID| PERSON_HSK| NEW_UUID|JSON|ERROR|PUBLICATION_ID|GLOBAL_PUBLICATION_ID|
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
|E57718A9D71676B0E...|5e7169d546e0fb000...|40FF470EE166F97BF...|142c9db1-b551-4a9...| | | | |
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
only showing top 1 row
2024-05-07 23:22:42.071664 0 132338
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
| UUID| OBJECT_ID| PERSON_HSK| NEW_UUID|JSON|ERROR|PUBLICATION_ID|GLOBAL_PUBLICATION_ID|
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
|05D5758CE42605F3C...|61bcefbf2467f07ee...|9CA6911A9F9849D7F...|330d331e-1556-43b...| | | | |
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
only showing top 1 row
2024-05-07 23:22:57.091143 1 132338
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
| UUID| OBJECT_ID| PERSON_HSK| NEW_UUID|JSON|ERROR|PUBLICATION_ID|GLOBAL_PUBLICATION_ID|
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
|00008C3484D67F84E...|61cac2f52467f0348...|A383E09AA61EF213E...|695d111d-8d81-4ff...| | | | |
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
only showing top 1 row
2024-05-07 23:23:24.483418 2 132338
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
| UUID| OBJECT_ID| PERSON_HSK| NEW_UUID|JSON|ERROR|PUBLICATION_ID|GLOBAL_PUBLICATION_ID|
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
|61A1A2849005D90DB...|5ebff1f0ad49b31cc...|41542A120B2CFA934...|a7777afe-6878-427...| | | | |
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
only showing top 1 row
2024-05-07 23:24:05.248907 3 132338
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
| UUID| OBJECT_ID| PERSON_HSK| NEW_UUID|JSON|ERROR|PUBLICATION_ID|GLOBAL_PUBLICATION_ID|
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
|BA5FD30E5ADB67B24...|5e718be746e0fb000...|0D87F8C08827059C0...|bc957e74-4a80-456...| | | | |
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+
only showing top 1 row
2024-05-07 23:24:59.117195 4 132338
And configuration:
builder = (pyspark.sql.SparkSession.builder.appName("GoldeenRecrodJsonsGenerator")
.config('spark.driver.memory',"64G")
.config('spark.driver.maxResultSize',"16G")
.config('spark.master',"spark://xxxx:7077")
.config("spark.executor.cores", "16")
.config("spark.executor.memory", "70G")
.config("spark.cores.max", "304")
.config("spark.task.cpus", "15")
.config("spark.e", "1200G")
.config("spark.executor.memoryOverhead", "16G")
.config("spark.task.maxFailures", "1")
.config('spark.submit.deployMode', 'client')
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog"))
spark = configure_spark_with_delta_pip(builder).getOrCreate()
What’s wrong with my code or enviroment.
- Checked configuration my spark cluster
- I used different n_splits like 5,10,20
New contributor
Emil Podwysocki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.