I’m currently working on a project with a partner where we analyze large datasets of past sporting events. There are approximately 30,000 events per year and we have historical data for five years. Thus, we’re looking at analyzing 150,000 events.
I wrote our first version in Python with a heavy reliance on Pandas (great tool, by the way). However, each time I add a “factor” or calculation to the process it slows down by quite a bit. Simply put, it can read all the various data files and data queries we need at a rate of 25-events-per-second.
The problem is once I start working with the data, it drops fast. If I add a simple calculation, it drops to 24… and each calculation after that it drops again.
With just 10-12 “factors” calculated from the data (we’re not talking complex loops or anything crazy), it’s now down to 6-events-per-second. … and we’re barely scratching the surface. At this rate, it’ll take this thing days to work through all 150k events!
I’ve optimized our data queries, flat file reads etc. Those really aren’t the issue. I can live with doing 15-20 events per second. However, I can’t live with such drastic slowdowns each time a new factor calculation is added.
I’ve read a lot where Python as an interpreted language is slow etc., but my question to the experts out there – what language should this project be done in?
EXAMPLE…
One of the key “DataFrames” I am using through Pandas is fairly large. 350 rows x 70 columns, as an example. Even when trying to simply multiply one column’s value by another…
data['value'] = data['col1'] * data['col2']
…I see what I could consider significant drop in performance. It’s very puzzling, and very frustrating. I understand this is a large set of data, but I can’t believe this is anything so crazy that Python would be slowing to a crawl.
If I just read the data files and do no calculations on them at all, it reads 67 events in 2.807 seconds. If I add one simple calculation where I execute code like the example above, it slows to 2.877 seconds. Based on our research, we’re needed to add upwards of 100 calculations on the data…. so 7 seconds worth of slowdown? Seems too hard to believe.
8
I work on a platform that uses Python for testing and executing market data.
For just the equity pricing there are roughly ~98,000 events for each equity
per year, so when dealing with an individual stock have a similar profile for
our data sets.
(The aforementioned platform is Quantopian, https://quantopian.com.)
We also use pandas for the majority of our data structures, and though it is tough to really compare without also comparing the machines used, the number of calculations done per event… we are able to process a few orders of magnitude more events
per second; so there may not be reasons to give up on pandas, yet.
(The backtester used by our site is opensourced, https://github.com/quantopian/zipline
if you’d like to see how we use pandas throughout; of course, we still have bottlenecks to hammer out, too!))
From what I can see with how you are using your data set, this tip may help.
If possible, avoid writing your calculations back into your dataset DataFrame.
A corollary to that is, if possible, avoid indexing in inner loops.
While powerful and expressive, the indexing of a DataFrame has a performance penalty; much like accessing into a Python dictionary will create a lookup penalty.
Writing the data into the DataFrame will be more expensive.
So if your program can just use the Series that is returned by multiplying those
two columns, you may see some benefit from assigning that Series to its own variable
instead of putting it back into the DataFrame.
To illustrate, below is some excerpts from an IPython notebook that step by step
gets the pandas DataFrame values closer and closer to using a raw np.ndarray.
This following is not intended to be a cookbook, but to show where both writing and indexing into the DataFrame can hurt performance.
Also, I wouldn’t be surprised if there is something in the pandas API that I am unaware of, which makes assignment of the column much more performant, which would invalidate my advice.
In [1]:
# Create a DataFrame with 3 columns and 350 random values each.
data = pd.DataFrame(randn(350, 3), columns=['col1', 'col2', 'value'])
In [2]:
# Time original example
%timeit data['value'] = data['col1'] * data['col2']
10000 loops, best of 3: 170 µs per loop
In [3]:
# Time without writing the example back into the DataFrame
# The value variable will be a pd.Series
%timeit value = data['col1'] * data['col2']
10000 loops, best of 3: 43.9 µs per loop
# Extract the pd.Series before inner loop
col1 = data['col1']
col2 = data['col2']
In [25]:
# Time without writing the example back into the DataFrame
# The value variable will be a pd.Series
%timeit value = col1 * col2
10000 loops, best of 3: 32.3 µs per loop
In [4]:
# Time with using the underlying numpy arrays.
# The value variable will be a np.ndarray
%timeit value = data['col1'].values * data['col2'].values
100000 loops, best of 3: 15.9 µs per loop
In [5]:
# Extract the numpy arrays before inner loop
col1_v = data['col1'].values
col2_v = data['col2'].values
In [6]:
# Time just the multiplication of two np.ndarrays
%timeit col1_v * col2_v
1000000 loops, best of 3: 1.5 µs per loop
2
Python is plenty fast for this purpose. Depending on what kind of analysis you’re running, I would expect to easily handle hundreds or even thousands of events per second.
Your slowdown is most likely algorithmic. It’s not enough to optimize pieces of the puzzle, you have to optimize the system as a whole by choosing appropriate data structures and algorithms.
I once encountered some code that ran very slowly, but looking at just one layer it didn’t look like it should. It appeared to be a simple linked list traversal with at most 64 nodes, which is O(n) with a pretty small n. Well it turns out, that function called another O(n) function, which called another O(n) function, and so on 6 layers deep. That resulted in an O(n6) algorithm that no one noticed before because they only looked at one layer at a time, and ran it on data that happened to short circuit fortuitously.
So look at your system as a whole. Are you reading the same files over and over again or caching appropriately? Are you performing the same joins or other calculations over and over again or preprocessing and memoizing your data for future use? Are you assuming lower layers of function calls are linear or constant time when they’re not? Can you rearrange the order of operations to make it faster? Can you create indexes to speed up certain lookups?
These are the sorts of issues you need to fix before you start looking at linear speedups you might gain from switching to another language. If it’s difficult to fix in python, imagine how difficult it will be to fix in an unfamiliar programming language.
1
You should use the language you are most comfortable with and that allows you to manipulate the data in the manner you’d like. Python has a number of scientific libraries that allow you to do just that.
Regarding performance issues – you won’t know until you get there. And for as small of a data set as you’re talking, you can always throw more hardware at it. And yes, 150k events is small.
You should also look into static analysis / performance analysis tools to identify what you could optimize within your code. If you’re new to python then you’re likely making mistakes (don’t feel bad, we’ve all been there). So there’s likely plenty of room for optimization. Switching to another language won’t resolve those issues. In other words, don’t blame the tool until you’re sure you have removed any operator error.
1