Does the “magic” of the JVM hinder the influence a programmer has over micro-optimisations in Java? I recently read in C++ sometimes the ordering of the data members can provide optimizations (granted, in the microsecond environment) and I presumed a programmer’s hands are tied when it comes to squeezing performance from Java?
I appreciate a decent algorithm provides greater speed-gains, but once you have the correct algorithm is Java harder to tweak due to the JVM control?
If not, could people give examples of what tricks you can use in Java (besides simple compiler flags).
7
Sure, at the micro-optimization level the JVM will do some things that you’ll have little control over compared to C and C++ especially.
On the other hand, the variety of compiler behaviors with C and C++ especially will have a far greater negative impact on your ability to do micro-optimizations in any sort of vaguely portable way (even across compiler revisions).
It depends on what sort of project you’re tweaking, what environments you’re targeting and so on. And in the end, it doesn’t really matter since you’re getting a few orders of magnitude better results from algorithmic/data structure/program design optimizations anyways.
4
Micro-optimizations are almost never worth the time, and almost all the easy ones are done automatically by compilers and runtimes.
There is, however, one important area of optimization where C++ and Java are fundamentally different, and that is bulk memory access. C++ has manual memory management, which means you can optimize the application’s data layout and access patterns to make full use of caches. This is quite hard, somewhat specific to the hardware you’re running on (so performance gains may disappear on different hardware), but if done right, it can lead to absolutely breathtaking performance. Of course you pay for it with the potential for all kinds of horrible bugs.
With a garbage collected language like Java, this kind of optimizations cannot be done in the code. Some can be done by the runtime (automatically or through configuration, see below), and some are just not possible (the price you pay for being protected from memory management bugs).
If not, could people give examples of what tricks you can use in Java (besides simple compiler flags).
Compiler flags are irrelevant in Java because the Java compiler does almost no optimization; the runtime does.
And indeed Java runtimes have a multitude of parameters that can be tweaked, especially concerning the garbage collector. There’s nothing “simple” about those options – the defaults are good for most applications, and getting better performance requires you to understand exactly what the options do and how your application behaves.
10
[…] (granted, in the microsecond environment) […]
Micro-seconds add up if we’re looping over millions to billions of things. A personal vtune/micro-optimization session from C++ (no algorithmic improvements):
T-Rex (12.3 million facets):
Initial Time: 32.2372797 seconds
Multithreading: 7.4896073 seconds
4.9201039 seconds
4.6946372 seconds
3.261677 seconds
2.6988536 seconds
SIMD: 1.7831 seconds
4-valence patch optimization: 1.25007 seconds
0.978046 seconds
0.970057 seconds
0.911041 seconds
Everything besides “multithreading”, “SIMD” (handwritten to beat the compiler), and the 4-valence patch optimization were micro-level memory optimizations. Also the original code starting from the initial times of 32 seconds was already optimized quite a bit (theoretically-optimal algorithmic complexity) and this is a recent session. The original version long before this recent session took over 5 minutes to process.
Optimizing memory efficiency can help often by anywhere from several times to orders of magnitudes in a single-threaded context, and more in multithreaded contexts (the benefits of an efficient memory rep often multiply with multiple threads in the mix).
On the Importance of Micro-Optimization
I get a little agitated by this idea that micro-optimizations are a waste of time. I agree that it’s good general advice, but not everyone does it incorrectly based on hunches and superstitions rather than measurements. Done correctly, it doesn’t necessarily yield a micro impact. If we take Intel’s own Embree (raytracing kernel) and test only the simple scalar BVH they’ve written (not ray packet which is exponentially harder to beat), and then try to beat the performance of that data structure, it can be a most humbling experience even for a veteran used to profiling and tuning code for decades. And it’s all because of micro-optimizations applied. Their solution can process over a hundred million rays per second when I’ve seen industrial professionals working in raytracing who can’t even get more than 3 million rays per second on the same hardware (same bounding volume hierarchy, same logarithmic search).
There’s no way to take a straightforward implementation of a BVH with only an algorithmic focus and get over a hundred million primary ray intersections per second out of it against any optimizing compiler (even Intel’s own ICC). A straightforward one often doesn’t even get a million rays per second. It takes professional-quality solutions to often even get a few million rays per second. It takes Intel-level micro-optimization to get over a hundred million rays per second.
Algorithms
I think micro-optimization isn’t important as long as performance isn’t important at the level of minutes to seconds, e.g., or hours to minutes. If we take a horrific algorithm like bubble sort and use it over a mass input as an example, and then compare it to even a basic implementation of merge sort, the former might take months to process, the latter maybe 12 minutes, as a result of quadratic vs linearithmic complexity.
The difference between months and minutes is probably going to make most people, even those not working in performance-critical fields, consider the execution time to be unacceptable if it requires users waiting for months to get a result.
Meanwhile, if we compare the non-micro-optimized, straightforward merge sort to quicksort (which is not at all algorithmically superior to merge sort, and only offers micro-level improvements for locality of reference), the micro-optimized quicksort might finish in 15 seconds as opposed to 12 minutes. Making users wait 12 minutes might be perfectly acceptable (coffee break kind of time).
I think this difference is probably negligible to most people between, say, 12 minutes and 15 seconds, and that’s why micro-optimization is often considered useless since it’s often only like the difference between minutes and seconds, and not minutes and months. The other reason I think it’s considered useless is that it’s often applied to areas that don’t matter: some little area which isn’t even loopy and critical which yields some questionable 1% difference (which may very well just be noise). But for people who care about these types of time differences and are willing to measure and do it right, I think it’s worth paying attention to at least the basic concepts of the memory hierarchy (specifically the upper levels relating to page faults and cache misses).
Java Leaves Plenty of Room for Good Micro-Optimizations
Phew, sorry — with that sort of rant aside:
Does the “magic” of the JVM hinder the influence a programmer has over
micro-optimisations in Java?
A little bit but not as much as people might think if you do it right. For example, if you are doing image processing, in native code with handwritten SIMD, multithreading, and memory optimizations (access patterns and possibly even representation depending on image processing algorithm), it’s easy to crunch hundreds of millions of pixels per second for 32-bit RGBA pixels (8-bit color channels) and sometimes even billions per second.
It’s impossible to get anywhere close in Java if you say, made a Pixel
object (this alone would inflate the size of a pixel from 4 bytes to 16 on 64-bit).
But you might be able to get a whole lot closer if you avoided the Pixel
object, used an array of bytes, and modeled an Image
object. Java’s still pretty competent there if you start using arrays of plain old data. I’ve tried these kinds of things before in Java and was quite impressed provided that you don’t create a bunch of little teeny objects everywhere that are 4 times bigger than normal (ex: use int
instead of Integer
) and start modeling bulk interfaces like an Image
interface, not Pixel
interface. I’d even venture to say that Java can rival C++ performance if you are looping over plain old data and not objects (huge arrays of float
, e.g., not Float
).
Perhaps even more important than the memory sizes is that an array of int
guarantees a contiguous representation. An array of Integer
does not. Contiguity is often essential for locality of reference since it means multiple elements (ex: 16 ints
) can all fit into a single cache line and potentially be accessed together prior to eviction with efficient memory access patterns. Meanwhile a single Integer
might be stranded somewhere in memory with surrounding memory being irrelevant, only to have that region of memory loaded into a cache line only to use a single integer prior to eviction as opposed to 16 integers. Even if we got marvelously lucky and surrounding Integers
were all right next to each other in memory, we can only fit 4 into a cache line that can be accessed prior to eviction as a result of Integer
being 4 times larger, and that’s in the best-case scenario.
And there’s plenty of micro-optimizations to be had there since we’re unified under the same memory architecture/hierarchy. Memory access patterns matter no matter what language you use, concepts like loop tiling/blocking might be generally applied far more often in C or C++, but they benefit Java just as much.
I recently read in C++ sometimes the ordering of the data members can
provide optimizations […]
The order of data members generally don’t matter in Java, but that’s mostly a good thing. In C and C++, preserving the order of data members is often important for ABI reasons so compilers don’t mess with that. Human developers working there have to be careful to do things like arrange their data members in descending order (largest to smallest) to avoid wasting memory on padding. With Java, apparently the JIT can reorder the members for you on the fly to ensure proper alignment while minimizing padding, so provided that’s the case, it automates something that average C and C++ programmers can often do poorly and end up wasting memory that way (which isn’t just wasting memory, but often wasting speed by increasing the stride between AoS structures needlessly and causing more cache misses). It’s a very robotic thing to rearrange fields to minimize padding, so ideally humans don’t deal with that. The only time where field arrangement might matter in a way that requires a human to know the optimal arrangement is if the object is bigger than 64 bytes and we’re arranging fields based on access pattern (not optimal padding) — in which case it might be a more human endeavor (requires understanding critical paths, some of which is information a compiler can’t possibly anticipate without knowing what users will do with the software).
If not, could people give examples of what tricks you can use in Java
(besides simple compiler flags).
The biggest difference to me in terms of an optimizing mentality between Java and C++ is that C++ might allow you to use objects a little (teeny) bit more than Java in a performance-critical scenario. For example, C++ can wrap an integer to a class with no overhead whatsoever (benchmarked all over the place). Java has to have that metadata pointer-style+alignment padding overhead per object which is why Boolean
is bigger than boolean
(but in exchange providing uniform benefits of reflection and the ability to override any function not marked as final
for every single UDT).
It’s a little bit easier in C++ to control the contiguity of memory layouts across non-homogeneous fields (ex: interleaving floats and ints into one array through a struct/class), since spatial locality is often lost (or at least control is lost) in Java when allocating objects through the GC.
… but often the highest-performance solutions will often split those up anyway and use an SoA access pattern over contiguous arrays of plain old data. So for the areas that need peak performance, the strategies to optimize memory layout between Java and C++ are often the same, and will often have you demolishing those teeny object-oriented interfaces in favor of collection-style interfaces that can do things like hot/cold field splitting, SoA reps, etc. Non-homogeneous AoSoA reps seem kind of impossible in Java (unless you just used a raw array of bytes or something like that), but those are for rare cases where both sequential and random access patterns need to be fast while simultaneously having a mixture of field types for hot fields. To me the bulk of the difference in optimization strategy (at the general kind of level) between these two is moot if you are reaching for peak performance.
The differences vary quite a bit more if you are simply reaching for “good” performance — there not being able to do as much with small objects like Integer
vs. int
can be a bit more of a PITA, especially with the way it interacts with generics. It’s a bit harder to just build one generic data structure as a central optimization target in Java that works for int
, float
, etc. while avoiding those bigger and expensive UDTs, but often the most performance-critical areas will require hand-rolling your own data structures tuned for a very specific purpose anyway so it’s only annoying for code that is striving for good performance but not peak performance.
Object Overhead
Note that Java object overhead (metadata and loss of spatial locality and temporary loss of temporal locality after an initial GC cycle) is often big for things that are really small (like int
vs. Integer
) which are being stored by the millions in some data structure that’s largely contiguous and accessed in very tight loops. There seems to be a lot of sensitivity about this subject, so I should clarify that you don’t want to be worrying about object overhead for big objects like images, just really minuscule objects like a single pixel.
If anyone feels doubtful about this part, I’d suggest making a benchmark between summing up a million random ints
vs. a million random Integers
and to do this repeatedly (the Integers
will reshuffle in memory after an initial GC cycle).
Ultimate Trick: Interface Designs That Leave Room to Optimize
So the ultimate Java trick as I see it if you are dealing with a place that handles a heavy load over small objects (ex: a Pixel
, a 4-vector, a 4×4 matrix, a Particle
, possibly even an Account
if it only has a few small fields) is to avoid using objects for these teeny things and use arrays (possibly chained together) of plain old data. The objects then become collection interfaces like Image
, ParticleSystem
, Accounts
, a collection of matrices or vectors, etc. Individual ones can be accessed by index, e.g. This is also one of the ultimate design tricks in C and C++, since even without that basic object overhead and disjointed memory, modeling the interface at the level of a single particle prevents the most efficient solutions.
4
There’s a middle area between micro-optimization, on the one hand, and good choice of algorithm, on the other.
It is the area of constant-factor speedups, and it can yield orders of magnitude.
The way it does so is by lopping off entire fractions of the execution time, like first 30%, then 20% of what’s left, then 50% of that, and so on for several iterations, until there’s hardly anything left.
You don’t see this in little demo-style programs. Where you see it is in big serious programs with lots of class data structures, where the call stack is typically many layers deep.
A good way to find the speedup opportunities is by examining random-time samples of the program’s state.
Generally the speedups consist of things like:
-
minimizing calls to
new
by pooling and re-using old objects, -
recognizing things being done that are sort of in there for generality’s sake, rather than actually being necessary,
-
revising the data structure by using different collection classes that have the same big-O behavior but take advantage of the accessing patterns actually used,
-
saving data that’s been acquired by function calls instead of re-calling the function, (It is a natural and amusing tendency of programmers to assume that functions having shorter names execute faster.)
-
tolerating a certain amount of inconsistency between redundant data structures, as opposed to trying to keep them entirely consistent with notification events,
-
etc. etc.
But of course none of these things should be done without first being shown to be problems by taking samples.
Java (as far as I’m aware) gives you no control over variable locations in memory so you have a harder time to avoid things like false-sharing and alignment of variables (you can pad out a class with several unused members). Another thing I don’t think you can take advantage of are instructions such as mmpause
, but these things are CPU specific and so if you figure you need it Java may not the be language to use.
There exists the Unsafe class that gives you flexibility of C/C++ but also with the danger of C/C++.
It might help you to look at the assembly code the JVM generates for your code
To read about a Java app that looks at this kind of detail see the Disruptor code released by LMAX
This question is very hard to answer, because it depends on language implementations.
In general there’s very little room for such “micro optimizations” these days. The main reason is that compilers take advantage of such optimizations during compilation. For example there’s no performance difference between pre-increment and post-increment operators in situations where their semantics are identical. Another example would be for example a loop like this for(int i=0; i<vec.size(); i++)
where one could argue that instead of calling the size()
member function during each iteration it would be better to get the size of the vector before the loop and then comparing against that single variable and thus avoiding function a call per iteration. However, there are cases in which a compiler will detect this silly case and cache the result. However, this is only possible when the function has no side-effects and the compiler can be sure that the vector size remains constant during the loop so it merely applies to fairly trivial cases.
5
could people give examples of what tricks you can use in Java (besides simple compiler flags).
Other than improvements of algorithms, be sure to consider the memory hierarchy and how the processor makes use of it. There are big benefits in reducing memory access latencies, once you understand how the language in question allocates memory to its data types and objects.
Java example to access an array of 1000×1000 ints
Consider the below sample code – it accesses the same area of memory (a 1000×1000 array of ints), but in a different order. On my mac mini (Core i7, 2.7 GHz) the output is as follows, showing that traversing the array by rows more than doubles the performance (average over 100 rounds each).
Processing columns by rows*** took 4 ms (avg)
Processing rows by columns*** took 10 ms (avg)
This is because the array is stored such that consecutive columns (i.e. int values) are placed adjacent in memory, whereas consecutive rows are not. For the processor to actually use the data, it has to be transferred to its caches. The transfer of memory is by a block of bytes, called a cache line – loading a cache line directly from memory introduces latencies and thus decrease a program’s performance.
For the Core i7 (sandy bridge) a cache line holds 64 bytes, thus each memory access retrieves 64 bytes. Because the first test accesses memory in a predictable sequence, the processor will pre-fetch data before it is actually consumed by the program. Overall, this results in less latency on memory accesses and thus improves the performance.
Code of sample:
package test;
import java.lang.*;
public class PerfTest {
public static void main(String[] args) {
int[][] numbers = new int[1000][1000];
long startTime;
long stopTime;
long elapsedAvg;
int tries;
int maxTries = 100;
// process columns by rows
System.out.print("Processing columns by rows");
for(tries = 0, elapsedAvg = 0; tries < maxTries; tries++) {
startTime = System.currentTimeMillis();
for(int r = 0; r < 1000; r++) {
for(int c = 0; c < 1000; c++) {
int v = numbers[r][c];
}
}
stopTime = System.currentTimeMillis();
elapsedAvg += ((stopTime - startTime) - elapsedAvg) / (tries + 1);
}
System.out.format("*** took %d ms (avg)n", elapsedAvg);
// process rows by columns
System.out.print("Processing rows by columns");
for(tries = 0, elapsedAvg = 0; tries < maxTries; tries++) {
startTime = System.currentTimeMillis();
for(int c = 0; c < 1000; c++) {
for(int r = 0; r < 1000; r++) {
int v = numbers[r][c];
}
}
stopTime = System.currentTimeMillis();
elapsedAvg += ((stopTime - startTime) - elapsedAvg) / (tries + 1);
}
System.out.format("*** took %d ms (avg)n", elapsedAvg);
}
}
The JVM can and often does interfere, and the JIT compiler can change significantly between versions Some micro-optimisations are impossible in Java due to language limitations, such as being hyper-threading friendly or the latest Intel processors’ SIMD collection.
A highly informative blog on the topic from one the Disruptor authors is recommended reading:
- http://mechanical-sympathy.blogspot.com/
One always has to ask why bother using Java if you want micro-optimisations, there are many alternative methods for acceleration of a function such as using JNA or JNI to pass onto a native library.