You can use <<
to multiply and >>
to divide numbers in python when I time them I find using the binary shift way of doing it is 10x faster than dividing or multiplying the regular way.
Why is using <<
and >>
a lot faster than *
and /
?
What are the behind the scene processes going on to make *
and /
so slow?
8
Lets look at two little C programs that do a bit shift and a divide.
#include <stdlib.h>
int main(int argc, char* argv[]) {
int i = atoi(argv[0]);
int b = i << 2;
}
#include <stdlib.h>
int main(int argc, char* argv[]) {
int i = atoi(argv[0]);
int d = i / 4;
}
These are then each compiled with gcc -S
to see what the actual assembly will be.
With the bit shift version, from the call to atoi
to return:
callq _atoi
movl $0, %ecx
movl %eax, -20(%rbp)
movl -20(%rbp), %eax
shll $2, %eax
movl %eax, -24(%rbp)
movl %ecx, %eax
addq $32, %rsp
popq %rbp
ret
While the divide version:
callq _atoi
movl $0, %ecx
movl $4, %edx
movl %eax, -20(%rbp)
movl -20(%rbp), %eax
movl %edx, -28(%rbp) ## 4-byte Spill
cltd
movl -28(%rbp), %r8d ## 4-byte Reload
idivl %r8d
movl %eax, -24(%rbp)
movl %ecx, %eax
addq $32, %rsp
popq %rbp
ret
Just by looking at this there are several more instructions in the divide version compared to the bit shift.
The key is what do they do?
In the bit shift version the key instruction is shll $2, %eax
which is a shift left logical – there’s the divide, and everything else is just moving values around.
In the divide version, you can see the idivl %r8d
– but just above that is a cltd
(convert long to double) and some additional logic around the spill and reload. This additional work, knowing that we’re dealing with a math rather than bits is often necessary to avoid various errors that can occur by doing just bit math.
Lets do some quick multiplication:
#include <stdlib.h>
int main(int argc, char* argv[]) {
int i = atoi(argv[0]);
int b = i >> 2;
}
#include <stdlib.h>
int main(int argc, char* argv[]) {
int i = atoi(argv[0]);
int d = i * 4;
}
Rather than going through all of this, there is one line different:
$ diff mult.s bit.s 24c24 > shll $2, %eax --- < sarl $2, %eax
Here the compiler was able to identify that the math could be done with a shift, however instead of a logical shift it does a arithmetic shift. The difference between these would be obvious if we ran these – sarl
preserves the sign. So that -2 * 4 = -8
while the shll
does not.
Lets look at this in a quick perl script:
#!/usr/bin/perl
$foo = 4;
print $foo << 2, "n";
print $foo * 4, "n";
$foo = -4;
print $foo << 2, "n";
print $foo * 4, "n";
Output:
16 16 18446744073709551600 -16
Um… -4 << 2
is 18446744073709551600
which is not exactly what you are likely expecting when dealing with multiplication and division. Its right, but its not integer multiplication.
And thus be wary of premature optimization. Let the compiler optimize for you – it knows what you’re really trying to do and will likely do a better job of it, with fewer bugs.
2
The existing answers didn’t really address the hardware side of things, so here’s a bit on that angle. The conventional wisdom is that multiplication and division are much slower than shifting, but the actual story today is more nuanced.
For example, it is certainly true that multiplication is a more complex operation to implement in hardware, but it doesn’t necessarily always end up slower. As it turns out, add
is also significantly more complex to implement than xor
(or in general any bitwise operation), but add
(and sub
) usually get enough transistors dedicated to their operation that end up being just as fast as the bitwise operators. So you can’t just look at hardware implementation complexity as a guide to speed.
So let’s look in detail at shifting versus the “full” operators like multiplication and shifting.
Shifting
On nearly all hardware, shifting by a constant amount (i.e., an amount the compiler can determine at compile-time) is fast. In particular, it will usually happen with a latency of a single cycle, and with a throughput of 1 per cycle or better. On some hardware (e.g., some Intel and ARM chips), certain shifts by a constant may even be “free” since they can be built into another instruction (lea
on Intel, the the special shifting abilities of the first source in ARM).
Shifting by a variable amount is more of a grey area. On older hardware, this was sometimes very slow, and the speed changed from generation to generation. For example, on the initial release of Intel’s P4, shifting by a variable amount was notoriously slow – requiring time proportional to the shift amount! On that platform, using multiplications to replace shifts could be profitable (i.e., the world has gone upside-down). On prior Intel chips, as well as subsequent generations, shifting by a variable amount wasn’t so painful.
On current Intel chips, shifting by a variable amount is not particularly fast, but it isn’t terrible either. The x86 architecture is hamstrung when it comes to variable shifts, because they defined the operation in an unusual way: shifts amounts of 0 don’t modify the condition flags, but all other shifts do. This inhibits the efficient renaming of the flags register since it can’t be determined until the shift executes whether subsequent instructions should read the condition codes written by the shift, or some prior instruction. Furthermore, shifts only write to part of the flags register, which may cause a partial flags stall.
The upshot then is that on recent Intel architectures, shift by a variable amount takes three “micro-operations” while most other simple operations (add, bitwise ops, even multiplication) only take 1. Such shifts can execute at most once every 2 cycles.
Multiplication
The trend in modern desktop and laptop hardware is to make multiplication a fast operation. On recent Intel and AMD chips, in fact, one multiplication can be issued every cycle (we call this reciprocal throughput). The latency, however, of a multiplication is 3 cycles. So that means you get the result of any given multiplication 3 cycles after you start it, but you are able to start a new multiplication every cycle. Which value (1 cycle or 3 cycles) is more important depends on the structure of your algorithm. If the multiplication is part of a critical dependency chain, the latency is important. If not, the reciprocal throughput or other factors may be more important.
They key takeaway is that on modern laptop chips (or better), multiplication is a fast operation, and likely to be faster than the 3 or 4 instruction sequence that a compiler would issue to “get the rounding” right for strength reduced shifts. For variable shifts, on Intel, multiplication would also generally be preferred due to the above-mentioned issues.
On smaller form-factor platforms, multiplication may still be slower, since building a full and fast 32-bit or especially 64-bit multiplier takes a lot of transistors and power. If someone can fill in with details of the performance of multiply on recent mobile chips it would be much appreciated.
Divide
Divide is both a more complex operation, hardware-wise, than multiplication and is also much less common in actual code – meaning that fewer resources are likely allocated to it. The trend in modern chips is still towards faster dividers, but even modern top-of-the-line chips take 10-40 cycles to do a divide, and they are only partially pipelined. In general, 64-bit divides are even slower than 32-bit divides. Unlike most other operations, division may take a variable number of cycles depending on the arguments.
Avoid divides and replace with shifts (or let the compiler do it, but you may need to check the assembly) if you can!
BINARY_LSHIFT and BINARY_RSHIFT are simpler processes algorithmically than BINARY_MULTIPLY and BINARY_FLOOR_DIVIDE and may take fewer clock-cycles. That is if you have any binary number and need to bitshift by N, all you have to do is shift the digits over that many spaces and replace with zeros. Binary multiplication is in general more complicated, though techniques like Dadda multiplier make it quite fast.
Granted, it is possible for an optimizing compiler to recognize cases when you multiply/divide by powers of two and replace with the appropriate left/right shift. By looking at disassembled byte code python apparently doesn’t do this:
>>> dis.dis(lambda x: x*4)
1 0 LOAD_FAST 0 (x)
3 LOAD_CONST 1 (4)
6 BINARY_MULTIPLY
7 RETURN_VALUE
>>> dis.dis(lambda x: x<<2)
1 0 LOAD_FAST 0 (x)
3 LOAD_CONST 1 (2)
6 BINARY_LSHIFT
7 RETURN_VALUE
>>> dis.dis(lambda x: x//2)
1 0 LOAD_FAST 0 (x)
3 LOAD_CONST 1 (2)
6 BINARY_FLOOR_DIVIDE
7 RETURN_VALUE
>>> dis.dis(lambda x: x>>1)
1 0 LOAD_FAST 0 (x)
3 LOAD_CONST 1 (1)
6 BINARY_RSHIFT
7 RETURN_VALUE
However, on my processor, I find multiplication and left/right shift have similar timing, and floor division (by a power of two) is about 25% slower:
>>> import timeit
>>> timeit.repeat("z=a + 4", setup="a = 37")
[0.03717184066772461, 0.03291916847229004, 0.03287005424499512]
>>> timeit.repeat("z=a - 4", setup="a = 37")
[0.03534698486328125, 0.03207516670227051, 0.03196907043457031]
>>> timeit.repeat("z=a * 4", setup="a = 37")
[0.04594111442565918, 0.0408930778503418, 0.045324087142944336]
>>> timeit.repeat("z=a // 4", setup="a = 37")
[0.05412912368774414, 0.05091404914855957, 0.04910898208618164]
>>> timeit.repeat("z=a << 2", setup="a = 37")
[0.04751706123352051, 0.04259490966796875, 0.041903018951416016]
>>> timeit.repeat("z=a >> 2", setup="a = 37")
[0.04719185829162598, 0.04201006889343262, 0.042105913162231445]
0