I am trying to process a large data set that has 167373 exercise logs. For context I am combining all coordinate data in each exercise log to create vectors and then calculating cosine similarity to find the most similar ones – which these may indicate common routes for running or biking. I need to write my code in dask/parallel and so far I think my code is breaking when I try to calculate (log,cosine sim value pairs)
def calculate_pairwise_cosine_similarities(group, threshold=0.95): # defining the function to calculate pairwise cosine similarities and filter those with score that is >= 0.95
sport, records = group
vectors = [record[2] for record in records]
log_ids = [record[1] for record in records]
pairs = [(i, j) for i in range(len(vectors)) for j in range(i + 1, len(vectors))] # creating pairs of indices
similarities = list(filter(lambda x: x[2] > threshold,
map(lambda pair: (log_ids[pair[0]], log_ids[pair[1]], cosine_similarity(vectors[pair[0]], vectors[pair[1]])), pairs)))
return sport, similarities
When I run my code I get this error:
“BrokenProcessPool: A process in the process poo; was terminated abruptly while the future was running or pending.”
Does someone know how I could rewrite my function to be in parallel using ‘Dask’ operations or what this issue means?
Grace cleland-pottie is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.