I have below scenario coded in a pyspark job:
df = read from file in s3. Not very big.
Some joins on df
[x,y,z] = df.collect to array (Not very big array)
for i in range(10):
These rows from array I am batching in groups and sending to threads for further processing parallelly.
Inside thread I do below operations.
Use x, y, z values as well as some other dataframes which are globally stored in python file. e.g. return df2.filter(col[some_col] between x and y).collect
The problem I am seeing is after first loop ends i=1, the memory is getting piled up in the cluster and eventually after 2-3 more loops, the job is failing. Is there any way we can stop the memory leak?