I’m working on a project involving sports performance analysis where I need to compare rows of a DataFrame based on multiple criteria (age, performance, and date). For each row, I need to compare it with all others and count those that meet certain conditions (less than or equal to for each criterion).
Context:
- The DataFrame has tens of millions of rows.
- I need to compare each row with others on multiple columns: age, performance, and date.
- For each row, I want to count the number of successful comparisons.
Current Approach:
I am currently using Numpy and Pandas. However, when adding multiple criteria (age, performance, and date), the algorithm becomes too slow and doesn’t seem well-suited for multi-dimensional comparisons.
Here’s a simplified example of my current code:
import pandas as pd
import numpy as np
# Generate random data
def generate_dynamic_dataframe(n):
age = np.round(np.random.uniform(20, 50, n), 3)
performance = np.round(np.random.uniform(5, 25, n), 2)
dates = pd.to_datetime(np.random.randint(946684800, 1672531199, n), unit='s')
df = pd.DataFrame({'age': age, 'performance': performance, 'date': dates})
return df
# Example comparison with NumPy vectorization (too slow for multiple criteria)
def compare_rows_with_vectorization(df):
n = len(df)
# Convert columns to NumPy arrays
age_array = df['age'].to_numpy()
performance_array = df['performance'].to_numpy()
date_array = df['date'].astype('int64').to_numpy() # Convert dates to int64
# Create comparison matrices
age_matrix = age_array[:, np.newaxis] <= age_array
performance_matrix = performance_array[:, np.newaxis] <= performance_array
date_matrix = date_array[:, np.newaxis] <= date_array
# Combine comparisons
valid_comparisons = age_matrix & performance_matrix & date_matrix
# Count successful comparisons
comparison_counts = valid_comparisons.sum(axis=1) - 1 # Exclude the row itself
df['comparison_count'] = comparison_counts
return df
# Generate data and apply the algorithm
df = generate_dynamic_dataframe(10000)
result_df = compare_rows_with_vectorization(df)
Problem:
This algorithm becomes too slow when processing DataFrames with millions of rows. It works, but the execution time is far too long. I’m looking for a more performant solution for this multi-criteria (age, performance, and date) comparison task.
For 10,000 lines, it already takes 5 seconds. And I have the impression that it believes exponentially.
Question:
-
Is there a more efficient method to perform multi-column comparisons (e.g., age, performance, and date) on a DataFrame with millions of rows?
-
Is it possible to use a Fenwick Tree or another efficient data structure for this multi-dimensional scenario
-
Are there better approaches using Pandas, NumPy, Dask, or even PySpark that could improve the performance?
Eventually I should apply this algorithm over several million lines, and the goal is to make this processing the fastest (a few seconds maximum).
Thanks in advance for your help! Any suggestions or optimization ideas are greatly appreciated.
Kévin Letellier is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1