I’m encountering a memory issue in my PySpark application in databricks. where the memory usage on the driver node keeps increasing over time until it eventually crashes due to an out-of-memory (OOM) error. The problem occurs when I execute the following function in a loop for each image in a folder:
Python
def read_image_and_save_to_df(image_path, save_path):
# Read image (tif)
# Convert it to a PySpark DataFrame
# Save to Parquet format
# For each image in the folder:
for image_path in image_folder:
read_image_and_save_to_df(image_path, save_path)
I’ve tried cleaning up objects at the end of the function and clearing cache and unpersisting RDDs in the SparkContext, but the memory usage still grows. It seems like a memory leak, but I’m not sure why it’s happening. Any ideas on how to troubleshoot this issue?
memory usage before clearing cache at the end of the function:
memory usage after clearing cache at the end of the function: