Thiết kế website giá rẻ

Question

I would like to divide data ~600 000 rows into 6 small packages, ~100 000 each.
I used the code below, and during the last iteration of my while loop, everything freezes without any errors, and only a kernel restart helps.

I run spark in a cluster on my own infrastructure.


# Read lookup data
t0 = time.time()
data_lookup = spark.read.format("delta").load("hdfs://xxxxx:9000/hd_micula/delta_jsons/data_lookup")
t1 = time.time()
total = t1-t0
print(total)
# Total 661691
# Define the number of splits
n_splits = 5
# Calculate count of each dataframe rows 
each_len = data_lookup.count() // n_splits 
print(datetime.now(), 'Each length:', each_len) 

# Create a copy of original dataframe 
copy_df = data_lookup 

# Iterate for each dataframe 
i = 0
while i <= n_splits: 
    if i == n_splits:
        temp_df = copy_df
    else:
        temp_df = copy_df.limit(each_len) 
        copy_df = copy_df.subtract(temp_df) 

    temp_df.show(1)
    print(datetime.now(), i, temp_df.count()) 

    del temp_df
    spark.catalog.clearCache() 
 
    i += 1

2024-05-07 23:22:38.088099 Each length: 132338 


+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
|                UUID|           OBJECT_ID|          PERSON_HSK|            NEW_UUID|JSON|ERROR|PUBLICATION_ID|GLOBAL_PUBLICATION_ID| 
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
|E57718A9D71676B0E...|5e7169d546e0fb000...|40FF470EE166F97BF...|142c9db1-b551-4a9...|    |     |              |                     | 
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
only showing top 1 row 

2024-05-07 23:22:42.071664 0 132338 


+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
|                UUID|           OBJECT_ID|          PERSON_HSK|            NEW_UUID|JSON|ERROR|PUBLICATION_ID|GLOBAL_PUBLICATION_ID| 
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
|05D5758CE42605F3C...|61bcefbf2467f07ee...|9CA6911A9F9849D7F...|330d331e-1556-43b...|    |     |              |                     | 
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
only showing top 1 row 

2024-05-07 23:22:57.091143 1 132338 


+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
|                UUID|           OBJECT_ID|          PERSON_HSK|            NEW_UUID|JSON|ERROR|PUBLICATION_ID|GLOBAL_PUBLICATION_ID| 
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
|00008C3484D67F84E...|61cac2f52467f0348...|A383E09AA61EF213E...|695d111d-8d81-4ff...|    |     |              |                     | 
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
only showing top 1 row 

2024-05-07 23:23:24.483418 2 132338 


+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
|                UUID|           OBJECT_ID|          PERSON_HSK|            NEW_UUID|JSON|ERROR|PUBLICATION_ID|GLOBAL_PUBLICATION_ID| 
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
|61A1A2849005D90DB...|5ebff1f0ad49b31cc...|41542A120B2CFA934...|a7777afe-6878-427...|    |     |              |                     | 
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
only showing top 1 row 

2024-05-07 23:24:05.248907 3 132338 


+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
|                UUID|           OBJECT_ID|          PERSON_HSK|            NEW_UUID|JSON|ERROR|PUBLICATION_ID|GLOBAL_PUBLICATION_ID| 
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
|BA5FD30E5ADB67B24...|5e718be746e0fb000...|0D87F8C08827059C0...|bc957e74-4a80-456...|    |     |              |                     | 
+--------------------+--------------------+--------------------+--------------------+----+-----+--------------+---------------------+ 
only showing top 1 row 

2024-05-07 23:24:59.117195 4 132338
 


And configuration:
 
 builder = (pyspark.sql.SparkSession.builder.appName("GoldeenRecrodJsonsGenerator")
.config('spark.driver.memory',"64G")
.config('spark.driver.maxResultSize',"16G")
.config('spark.master',"spark://xxxx:7077")
.config("spark.executor.cores", "16")
.config("spark.executor.memory", "70G")
.config("spark.cores.max", "304")
.config("spark.task.cpus", "15")
.config("spark.e", "1200G")
.config("spark.executor.memoryOverhead", "16G")
.config("spark.task.maxFailures", "1")
.config('spark.submit.deployMode', 'client')
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog"))
spark = configure_spark_with_delta_pip(builder).getOrCreate()

What’s wrong with my code or enviroment.

Checked configuration my spark cluster
I used different n_splits like 5,10,20

Thiết kế website giá rẻ

Danh mục

While loop freezes in the last iteration