When compiling C code with gcc
, there are compiler optimizations, some that limit code size and others create fast code.
From the -S
flag, I see that the -O2/03
generates more assembly than the -Os
code. How is more assembly still faster than less assembly?
6
On a modern processor, there are usually several ways to achieve the result specified in a higher level language (such as C). These solutions can have different trade-offs between code size and speed due to several factors.
- Not all assembly instructions take the same amount of time to execute. For example, it is possible that a particular result can be achieved with 2 instructions that take 10 clock-cycles each to execute, or with 6 instructions that take 3 clock-cycles each. The difference here can be because those two long instructions duplicate some of the work that the compiler avoided by using the 6 short instructions.
- On a modern processor, it makes a huge difference in execution speed if the next instruction is already present in the cache or if it has to come from main memory. This effect is most visible with branching instructions, because they make it harder to tell what the next instruction will be. Often, compilers will try to offset these effects by unrolling (part of) a loop into a repeated block of instructions to reduce the costs of branching/jumping.
Well, most of the time the complier generates more instructions so that fewer of them are executed in a given run. Usually by generating specific code for different cases:
- Loop unrolling. The jump is only done in every n (usually 8) iterations.
- Function inlining. It saves the call, return, copying arguments and stack manipulation.
The other thing is that some instructions take more time than other. Especially conditions that are difficult to predict can be significantly slower. However both calls and loops are common and the predictor handles them well.
The other thing is memory cache, but here things are not so clear cut. The caches work better when the code is read linearly (functions are inlined), but it also has limited size, so larger portion of small code will be cached.