How do I know if my code is running fast enough? Is there a measurable way to test the speed & performance of my code?
For example, I have script that is reading CSV files and writing new CSV files while using Numpy to calculate statistics. Below, I’m using cProfiler for my Python script but after seeing resulting stats, what do I do next? In this case, I can see that the methods mean, astype, reduce from numpy, method writerow from csv and method append of python lists is taking a significant portion of the time.
How can I know if my code can improve or not?
python -m cProfile -s cumulative OBSparser.py
176657699 function calls (176651606 primitive calls) in 528.419 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.003 0.003 528.421 528.421 OBSparser.py:1(<module>)
1 0.000 0.000 526.874 526.874 OBSparser.py:45(start)
1 165.767 165.767 526.874 526.874 OBSparser.py:48(parse)
7638018 6.895 0.000 179.890 0.000 {method 'mean' of 'numpy.ndarray' objects}
7638018 56.780 0.000 172.995 0.000 _methods.py:53(_mean)
7628171 57.232 0.000 57.232 0.000 {method 'writerow' of '_csv.writer' objects}
7700878 52.580 0.000 52.580 0.000 {method 'reduce' of 'numpy.ufunc' objects}
7615219 50.640 0.000 50.640 0.000 {method 'astype' of 'numpy.ndarray' objects}
7668436 28.595 0.000 36.853 0.000 _methods.py:43(_count_reduce_items)
15323753 31.503 0.000 31.503 0.000 {numpy.core.multiarray.array}
45751805 13.439 0.000 13.439 0.000 {method 'append' of 'list' objects}
Can somebody explain the best practices?
How do I know if my code is running fast enough?
That very much depends on your use case — your program runs for 1.4 hours which might or might not be fast enough. If this is a one-time process 1.4 hours is not that much – spending any time on optimization is hardly worth the investment. On the other hand, if this is a process that should run e.g. once every hour, clearly it is worth finding a less time-consuming approach
Is there a measurable way to test the speed & performance of my code?
yes, profiling – and you’ve already done that. That’s a good start.
what do I do next?
Best practices include:
- measure baseline performance (before any optimization)
- analyze the parts where the program spends most of its time
- reduce run-time complexity (the Big-O type)
- check for the potential of parallel computation
- compare against baseline performance
You have already done 1. So let’s move to 2.
Analysis
In your case the program spends most of it’s time in line OBSparser.py:48, of which a third is spent calculating the mean 7638018 times.
As the profiler output shows, this is on an ndarray, i.e. using numpy, and it doesn’t look like it’s taking a lot of time on a per-call basis. A quick calculation confirms that:
179′ / 7.638.018 = 23.6 microseconds per call
Since that’s already implemented in C-code (numpy), there is likely not much you can do to improve the per-call performance by changing the actual mean
code (or using another library).
However, ask yourself several questions:
- How can the number of calls to
.mean()
be reduced? - Can the calls to
.mean()
be implemented more efficiently? - Could the data be grouped and each group be processed independently?
- ask more questions
Other calls worth looking at are to .astype() and reduce
, I focused on .mean()
simply for illustration.
Reducing complexity
Not knowing what your code actually does, here’s my 5cents on the specifics, anyway:
On 2., a quick check on my i7 core reveals that for ndarray.mean()
to take 20-odd microseconds, this takes around 50 values. So I’m guessing your are grouping values and then calling .mean()
on every group.
There might be more efficient ways – a search on numpy group aggregate performance or some variant of that might find you some helpful pointers.
Parallel computation
On 3. I’m guessing multi-processing is unlikely to be a solution here, since your computations seem mostly CPU-bound and the overhead of launching seperate tasks and exchanging data probably outweighs the benefits.
However there might be some use of SIMD-approach, i.e. vectorization. Again, just a hunch.
Compare against baseline performance
To reduce the time it takes to re-profile, consider subsetting your data such that the performance behavior is still visible (i.e. 23 us per call to .mean()
) but where the total running time is under maybe 1-2 minutes, or even less. This will help you evaluate several approaches before applying them to your program in full. There is no use in running the full process over and over again just to test some small optimization.
You have forgotten the most basic question:
Is the speed satisfactory for the use case?
- If the answer is “yes” -> don’t profile
- If no, you might look at your table.
But honestly, it looks not terribly useful, because almost all time is spend in OBSparser.py:48(parse), which takes a LONG time. I would suggest you refactor that method into several separate methods.
You might use a visualizer to visualize the results, pycharm has good support for that use case.
This is what non-functional requirements of performance are for.
The notion of fast enough has nothing technical per se. It depends on user perception of your product, and should be translated through the requirements. This is the only objective way for you to tell whether your actual implementation is fast enough or not.
If you don’t have those requirements, anything else is speculation and unconstructive.
-
The user tells you that the app feels slow, but at any point, anybody specifies what slow means in terms of milliseconds, on which hardware and for which feature? Unconstructive: you can’t improve the code based on that, and you essentially can’t tell that a revision ago, the code was unacceptably slow, and now, it’s fast enough.
-
You think a specific feature can run faster than it currently is? That’s premature optimization, and goes against your users, who may not care at all about the speed of this feature, and may prioritize a specific bug, or need a new feature, or need something else to be faster.
How can I know if my code can improve or not?
Assume it always can. Some of the techniques include:
-
Rewriting code to use more memory but less CPU, or more CPU but less memory. This often leads to code which is very difficult to read, understand and maintain; this is one of the reasons why premature optimization should be avoided.
-
Using different data structures.
-
Relying on caching, precomputing stuff or using OLAP cubes.
-
Moving low level, including down to the Assembler.
-
Not doing the task. At all. That’s the ultimate optimization from N seconds to zero.
As others have noted, don’t optimize unless the speed is unsatisfactory.
You’ve moved on to the next step which is to profile.
Once you’ve profiled its time to look for possible optimization candidates:
- Looking at your process which runs for 528 seconds.
- You have one call to OBSparser.py:48(parse) using 166 seconds. If you could totally eliminate that time you would reduce the total time by only 31%
- You have a number of calls to routines consuming between 50 and 60 seconds. Eliminating the time spend on any one of those calls would save about 10% of the time.
I don’t see any place you can significantly improve performance. With a lot of work, you might be able to gain a 10 to 20% performance improvement. Unless there are strong reasons to improve performance, I would consider the optimization done.
I usually don’t find optimization very useful if I haven’t identified a routine using at least 80% of the time. A tenfold performance improvement on such a routine will cut the time to 30% or less.
If you do find such a routine, look for a better algorithm. If you don’t, don’t waste your time.
It’s not about testing, it’s about tuning.
Since you’re doing a lot of I/O, any sort of “CPU profiler” is not what you want.
The method I always use is this.
Here’s what I would do if I were you: Tune the program until it is as fast as possible.
Then if it is not fast enough to be satisfactory, get faster hardware.
The way I would do it is take a number of samples manually.
Some of them will be in the process of doing I/O.
If they are mostly in I/O, then I would ask if there is any way to avoid some of that I/O.
(Don’t assume ahead of time that all I/O it’s doing is necessary. You may find that it’s doing something that could actually be avoided.)
If you can avoid some of the I/O, that will speed you up accordingly.
Now look at the samples landing in non-I/O processing.
Is it significant, like it takes more than 10% of the samples?
If so, is there any way to speed that up, by avoiding some of the work?
Each time you find something to improve, fix the program and run it all over again.
You may be pleasantly surprised that, since the last fix, some new thing shows up to fix, that you didn’t see before, but now it’s important.
When you can’t find anything more to fix, you can declare the program “as fast as you or probably anyone can make it”.
Then if it’s still not fast enough, your only option is faster CPU, solid-state disk drive, or whatever.