When coding performance-critical portions of code (not necessarily large, but code that gets executed a lot), what topics should I be aware of/take into account.
I’m already fairly familiar with caching, aliasing, branching, cost of arithmetic operations, loop unrolling (and I guess some others that don’t come to mind).
What else should I know? (not necessarily looking for details, just concepts I can look up on my own)
7
My 2 cents:
It’s about how effectively you manage your resources and utilize the available hardware.
Generally, you will be limited in the amount of memory you have(RAM), but there’s aplenty of processing speed available than we utilize.
While effective utilization of memory is largely dependent upon the techniques and algorithms(caching, aliasing, branching, cost of arithmetic operations, loop unrolling (and I guess some others that don’t come to mind)). In platforms where memory is managed(Java, .NET), having GC run frequently takes a toll on performance. If there is less free memory, GC keeps running frequently trying to free memory. So, keep a tab on how your objects are created and destroyed. In languages where memory is explicitly managed(C,C++), by rule of thumb, you always keep a check(you free the memory as soon as you finish the task). So, in both cases, having enough free memory will ensure having little/less toll on performance.
Concurrency is the way to utilize CPU effectively. Instead of blocking and keeping CPU idle, concurrency allows you to utilize CPU effectively on tasks that could be run in parallel. While concurrency is one step above single threaded programs, the advent of multi-core machines and its ubiquity needs us to take one more step(see here) to effectively utilize all the cores. Utilizing CPU effectively improves the performance.
3
I think when it comes to performance, being familiar with the compiler optimization strategies and methods is essential. What this helps you with is that when you understand what kind of optimizations the compiler is capable of doing and under which circumstances, you actually can offload the performance tuning to the compiler(as you should anyway, if possible) by making sure that you give it the maximum possible opportunity to optimize the code.
Of course being familiar with tools which assist you with profiling and benchmarking your code/implementations is an essential. Infact, there’s no point in optimizing unless you can measure the benefit. In the end this boils down to being able to reason between multiple implementations and what and why the performance differs at the lowest level possible – without this understanding it is very hard to come up with valid optimizing opportunities which you can reason about.
As the bottom line I would argue that being familiar with the compiler and it’s optimizing capabilities and opportunities as well as being familiar with tools to measure benefits and spot bottlenecks as well as understanding the CPU’s capabilities and instruction set are the three most important things for any programmer when it comes to writing performant native code. Yes, this is very general but as such it’s also very applicable. Everything boils down to this in the end.
First of all, most performance problems end up being caused either by using, possibly by mistake, algorithm with unfavourable complexity. So first and foremost, keep on lookout for anything that is done more times than expected.
When you’ve flushed those out and are really down to micro-optimizations as your last chance, tuning for memory cache utilization can make some difference. It can be profiled using valgrind cachegrind skin. General suggestions are to keep the working set (frequently accessed memory) small, related objects together and possibly employ allocator that can do cache coloring.
Also despite the research that went into allocators, standard libraries still often provide rather poor memory allocators, so consider using a custom one (umem, tcmalloc, …) and/or employ memory pools for common data structures (improvements to allocator specification in C++11 make that much easier and more usable).
5
if you need to work with numbers and 100% accurate results are not needed then don’t be shy to try out faster less accurate algorithms
this is heavily apparent in game development where the results on screen don’t need to be exact they just need to be close enough too look right, it’s more important the calculations happen fast; see the fast inverse square root for a classic example