Is Python suitable for a statistical modeling application looking over thousands of past events?

I’m currently working on a project with a partner where we analyze large datasets of past sporting events. There are approximately 30,000 events per year and we have historical data for five years. Thus, we’re looking at analyzing 150,000 events.

I wrote our first version in Python with a heavy reliance on Pandas (great tool, by the way). However, each time I add a “factor” or calculation to the process it slows down by quite a bit. Simply put, it can read all the various data files and data queries we need at a rate of 25-events-per-second.

The problem is once I start working with the data, it drops fast. If I add a simple calculation, it drops to 24… and each calculation after that it drops again.

With just 10-12 “factors” calculated from the data (we’re not talking complex loops or anything crazy), it’s now down to 6-events-per-second. … and we’re barely scratching the surface. At this rate, it’ll take this thing days to work through all 150k events!

I’ve optimized our data queries, flat file reads etc. Those really aren’t the issue. I can live with doing 15-20 events per second. However, I can’t live with such drastic slowdowns each time a new factor calculation is added.

I’ve read a lot where Python as an interpreted language is slow etc., but my question to the experts out there – what language should this project be done in?

EXAMPLE…

One of the key “DataFrames” I am using through Pandas is fairly large. 350 rows x 70 columns, as an example. Even when trying to simply multiply one column’s value by another…

 data['value'] = data['col1'] * data['col2']

…I see what I could consider significant drop in performance. It’s very puzzling, and very frustrating. I understand this is a large set of data, but I can’t believe this is anything so crazy that Python would be slowing to a crawl.

If I just read the data files and do no calculations on them at all, it reads 67 events in 2.807 seconds. If I add one simple calculation where I execute code like the example above, it slows to 2.877 seconds. Based on our research, we’re needed to add upwards of 100 calculations on the data…. so 7 seconds worth of slowdown? Seems too hard to believe.

I work on a platform that uses Python for testing and executing market data.
For just the equity pricing there are roughly ~98,000 events for each equity
per year, so when dealing with an individual stock have a similar profile for
our data sets.

(The aforementioned platform is Quantopian, https://quantopian.com.)

We also use pandas for the majority of our data structures, and though it is tough to really compare without also comparing the machines used, the number of calculations done per event… we are able to process a few orders of magnitude more events
per second; so there may not be reasons to give up on pandas, yet.

(The backtester used by our site is opensourced, https://github.com/quantopian/zipline
if you’d like to see how we use pandas throughout; of course, we still have bottlenecks to hammer out, too!))

From what I can see with how you are using your data set, this tip may help.

If possible, avoid writing your calculations back into your dataset DataFrame.

A corollary to that is, if possible, avoid indexing in inner loops.

While powerful and expressive, the indexing of a DataFrame has a performance penalty; much like accessing into a Python dictionary will create a lookup penalty.

Writing the data into the DataFrame will be more expensive.
So if your program can just use the Series that is returned by multiplying those
two columns, you may see some benefit from assigning that Series to its own variable
instead of putting it back into the DataFrame.

To illustrate, below is some excerpts from an IPython notebook that step by step
gets the pandas DataFrame values closer and closer to using a raw np.ndarray.
This following is not intended to be a cookbook, but to show where both writing and indexing into the DataFrame can hurt performance.
Also, I wouldn’t be surprised if there is something in the pandas API that I am unaware of, which makes assignment of the column much more performant, which would invalidate my advice.

In [1]:
# Create a DataFrame with 3 columns and 350 random values each.
data = pd.DataFrame(randn(350, 3), columns=['col1', 'col2', 'value'])

In [2]:
# Time original example
%timeit data['value'] = data['col1'] * data['col2']
10000 loops, best of 3: 170 µs per loop

In [3]:
# Time without writing the example back into the DataFrame
# The value variable will be a pd.Series
%timeit value = data['col1'] * data['col2']
10000 loops, best of 3: 43.9 µs per loop

# Extract the pd.Series before inner loop 
col1 = data['col1']
col2 = data['col2']

In [25]:
# Time without writing the example back into the DataFrame
# The value variable will be a pd.Series
%timeit value = col1 * col2
10000 loops, best of 3: 32.3 µs per loop

In [4]:
# Time with using the underlying numpy arrays.
# The value variable will be a np.ndarray
%timeit value = data['col1'].values * data['col2'].values
100000 loops, best of 3: 15.9 µs per loop

In [5]:
# Extract the numpy arrays before inner loop 
col1_v = data['col1'].values
col2_v = data['col2'].values

In [6]:
# Time just the multiplication of two np.ndarrays 
%timeit col1_v * col2_v
1000000 loops, best of 3: 1.5 µs per loop

Python is plenty fast for this purpose. Depending on what kind of analysis you’re running, I would expect to easily handle hundreds or even thousands of events per second.

Your slowdown is most likely algorithmic. It’s not enough to optimize pieces of the puzzle, you have to optimize the system as a whole by choosing appropriate data structures and algorithms.

I once encountered some code that ran very slowly, but looking at just one layer it didn’t look like it should. It appeared to be a simple linked list traversal with at most 64 nodes, which is O(n) with a pretty small n. Well it turns out, that function called another O(n) function, which called another O(n) function, and so on 6 layers deep. That resulted in an O(n⁶) algorithm that no one noticed before because they only looked at one layer at a time, and ran it on data that happened to short circuit fortuitously.

So look at your system as a whole. Are you reading the same files over and over again or caching appropriately? Are you performing the same joins or other calculations over and over again or preprocessing and memoizing your data for future use? Are you assuming lower layers of function calls are linear or constant time when they’re not? Can you rearrange the order of operations to make it faster? Can you create indexes to speed up certain lookups?

These are the sorts of issues you need to fix before you start looking at linear speedups you might gain from switching to another language. If it’s difficult to fix in python, imagine how difficult it will be to fix in an unfamiliar programming language.

You should use the language you are most comfortable with and that allows you to manipulate the data in the manner you’d like. Python has a number of scientific libraries that allow you to do just that.

Regarding performance issues – you won’t know until you get there. And for as small of a data set as you’re talking, you can always throw more hardware at it. And yes, 150k events is small.

You should also look into static analysis / performance analysis tools to identify what you could optimize within your code. If you’re new to python then you’re likely making mistakes (don’t feel bad, we’ve all been there). So there’s likely plenty of room for optimization. Switching to another language won’t resolve those issues. In other words, don’t blame the tool until you’re sure you have removed any operator error.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 11:57

Thẻ: performance, programming-languages, python

Is Python suitable for a statistical modeling application looking over thousands of past events?

The problem is once I start working with the data, it drops fast. If I add a simple calculation, it drops to 24… and each calculation after that it drops again.

I’ve read a lot where Python as an interpreted language is slow etc., but my question to the experts out there – what language should this project be done in?

EXAMPLE…

One of the key “DataFrames” I am using through Pandas is fairly large. 350 rows x 70 columns, as an example. Even when trying to simply multiply one column’s value by another…

 data['value'] = data['col1'] * data['col2']

(The aforementioned platform is Quantopian, https://quantopian.com.)

(The backtester used by our site is opensourced, https://github.com/quantopian/zipline
if you’d like to see how we use pandas throughout; of course, we still have bottlenecks to hammer out, too!))

From what I can see with how you are using your data set, this tip may help.

If possible, avoid writing your calculations back into your dataset DataFrame.

A corollary to that is, if possible, avoid indexing in inner loops.

While powerful and expressive, the indexing of a DataFrame has a performance penalty; much like accessing into a Python dictionary will create a lookup penalty.

In [1]:
# Create a DataFrame with 3 columns and 350 random values each.
data = pd.DataFrame(randn(350, 3), columns=['col1', 'col2', 'value'])

In [2]:
# Time original example
%timeit data['value'] = data['col1'] * data['col2']
10000 loops, best of 3: 170 µs per loop

In [3]:
# Time without writing the example back into the DataFrame
# The value variable will be a pd.Series
%timeit value = data['col1'] * data['col2']
10000 loops, best of 3: 43.9 µs per loop

# Extract the pd.Series before inner loop 
col1 = data['col1']
col2 = data['col2']

In [25]:
# Time without writing the example back into the DataFrame
# The value variable will be a pd.Series
%timeit value = col1 * col2
10000 loops, best of 3: 32.3 µs per loop

In [4]:
# Time with using the underlying numpy arrays.
# The value variable will be a np.ndarray
%timeit value = data['col1'].values * data['col2'].values
100000 loops, best of 3: 15.9 µs per loop

In [5]:
# Extract the numpy arrays before inner loop 
col1_v = data['col1'].values
col2_v = data['col2'].values

In [6]:
# Time just the multiplication of two np.ndarrays 
%timeit col1_v * col2_v
1000000 loops, best of 3: 1.5 µs per loop

Python is plenty fast for this purpose. Depending on what kind of analysis you’re running, I would expect to easily handle hundreds or even thousands of events per second.

Your slowdown is most likely algorithmic. It’s not enough to optimize pieces of the puzzle, you have to optimize the system as a whole by choosing appropriate data structures and algorithms.

Regarding performance issues – you won’t know until you get there. And for as small of a data set as you’re talking, you can always throw more hardware at it. And yes, 150k events is small.

Filed under: softwareengineering - @ 11:57

Thẻ: performance, programming-languages, python

Thiết kế website giá rẻ

Danh mục

Is Python suitable for a statistical modeling application looking over thousands of past events?

Is Python suitable for a statistical modeling application looking over thousands of past events?