Speeds of <> multiplication and division

You can use << to multiply and >> to divide numbers in python when I time them I find using the binary shift way of doing it is 10x faster than dividing or multiplying the regular way.

Why is using << and >> a lot faster than * and /?

What are the behind the scene processes going on to make * and / so slow?

Lets look at two little C programs that do a bit shift and a divide.

#include <stdlib.h>

int main(int argc, char* argv[]) {
        int i = atoi(argv[0]);
        int b = i << 2;
}

#include <stdlib.h>

int main(int argc, char* argv[]) {
        int i = atoi(argv[0]);
        int d = i / 4;
}

These are then each compiled with gcc -S to see what the actual assembly will be.

With the bit shift version, from the call to atoi to return:

    callq   _atoi
    movl    $0, %ecx
    movl    %eax, -20(%rbp)
    movl    -20(%rbp), %eax
    shll    $2, %eax
    movl    %eax, -24(%rbp)
    movl    %ecx, %eax
    addq    $32, %rsp
    popq    %rbp
    ret

While the divide version:

    callq   _atoi
    movl    $0, %ecx
    movl    $4, %edx
    movl    %eax, -20(%rbp)
    movl    -20(%rbp), %eax
    movl    %edx, -28(%rbp)         ## 4-byte Spill
    cltd
    movl    -28(%rbp), %r8d         ## 4-byte Reload
    idivl   %r8d
    movl    %eax, -24(%rbp)
    movl    %ecx, %eax
    addq    $32, %rsp
    popq    %rbp
    ret

Just by looking at this there are several more instructions in the divide version compared to the bit shift.

The key is what do they do?

In the bit shift version the key instruction is shll $2, %eax which is a shift left logical – there’s the divide, and everything else is just moving values around.

In the divide version, you can see the idivl %r8d – but just above that is a cltd (convert long to double) and some additional logic around the spill and reload. This additional work, knowing that we’re dealing with a math rather than bits is often necessary to avoid various errors that can occur by doing just bit math.

Lets do some quick multiplication:

#include <stdlib.h>

int main(int argc, char* argv[]) {
    int i = atoi(argv[0]);
    int b = i >> 2;
}

#include <stdlib.h>

int main(int argc, char* argv[]) {
    int i = atoi(argv[0]);
    int d = i * 4;
}

Rather than going through all of this, there is one line different:

$ diff mult.s bit.s
24c24
>    shll    $2, %eax
---
<    sarl    $2, %eax

Here the compiler was able to identify that the math could be done with a shift, however instead of a logical shift it does a arithmetic shift. The difference between these would be obvious if we ran these – sarl preserves the sign. So that -2 * 4 = -8 while the shll does not.

Lets look at this in a quick perl script:

#!/usr/bin/perl

$foo = 4;
print $foo << 2, "n";
print $foo * 4, "n";

$foo = -4;
print $foo << 2, "n";
print $foo * 4, "n";

Output:

16
16
18446744073709551600
-16

Um… -4 << 2 is 18446744073709551600 which is not exactly what you are likely expecting when dealing with multiplication and division. Its right, but its not integer multiplication.

And thus be wary of premature optimization. Let the compiler optimize for you – it knows what you’re really trying to do and will likely do a better job of it, with fewer bugs.

The existing answers didn’t really address the hardware side of things, so here’s a bit on that angle. The conventional wisdom is that multiplication and division are much slower than shifting, but the actual story today is more nuanced.

For example, it is certainly true that multiplication is a more complex operation to implement in hardware, but it doesn’t necessarily always end up slower. As it turns out, add is also significantly more complex to implement than xor (or in general any bitwise operation), but add (and sub) usually get enough transistors dedicated to their operation that end up being just as fast as the bitwise operators. So you can’t just look at hardware implementation complexity as a guide to speed.

So let’s look in detail at shifting versus the “full” operators like multiplication and shifting.

Shifting

On nearly all hardware, shifting by a constant amount (i.e., an amount the compiler can determine at compile-time) is fast. In particular, it will usually happen with a latency of a single cycle, and with a throughput of 1 per cycle or better. On some hardware (e.g., some Intel and ARM chips), certain shifts by a constant may even be “free” since they can be built into another instruction (lea on Intel, the the special shifting abilities of the first source in ARM).

Shifting by a variable amount is more of a grey area. On older hardware, this was sometimes very slow, and the speed changed from generation to generation. For example, on the initial release of Intel’s P4, shifting by a variable amount was notoriously slow – requiring time proportional to the shift amount! On that platform, using multiplications to replace shifts could be profitable (i.e., the world has gone upside-down). On prior Intel chips, as well as subsequent generations, shifting by a variable amount wasn’t so painful.

On current Intel chips, shifting by a variable amount is not particularly fast, but it isn’t terrible either. The x86 architecture is hamstrung when it comes to variable shifts, because they defined the operation in an unusual way: shifts amounts of 0 don’t modify the condition flags, but all other shifts do. This inhibits the efficient renaming of the flags register since it can’t be determined until the shift executes whether subsequent instructions should read the condition codes written by the shift, or some prior instruction. Furthermore, shifts only write to part of the flags register, which may cause a partial flags stall.

The upshot then is that on recent Intel architectures, shift by a variable amount takes three “micro-operations” while most other simple operations (add, bitwise ops, even multiplication) only take 1. Such shifts can execute at most once every 2 cycles.

Multiplication

The trend in modern desktop and laptop hardware is to make multiplication a fast operation. On recent Intel and AMD chips, in fact, one multiplication can be issued every cycle (we call this reciprocal throughput). The latency, however, of a multiplication is 3 cycles. So that means you get the result of any given multiplication 3 cycles after you start it, but you are able to start a new multiplication every cycle. Which value (1 cycle or 3 cycles) is more important depends on the structure of your algorithm. If the multiplication is part of a critical dependency chain, the latency is important. If not, the reciprocal throughput or other factors may be more important.

They key takeaway is that on modern laptop chips (or better), multiplication is a fast operation, and likely to be faster than the 3 or 4 instruction sequence that a compiler would issue to “get the rounding” right for strength reduced shifts. For variable shifts, on Intel, multiplication would also generally be preferred due to the above-mentioned issues.

On smaller form-factor platforms, multiplication may still be slower, since building a full and fast 32-bit or especially 64-bit multiplier takes a lot of transistors and power. If someone can fill in with details of the performance of multiply on recent mobile chips it would be much appreciated.

Divide

Divide is both a more complex operation, hardware-wise, than multiplication and is also much less common in actual code – meaning that fewer resources are likely allocated to it. The trend in modern chips is still towards faster dividers, but even modern top-of-the-line chips take 10-40 cycles to do a divide, and they are only partially pipelined. In general, 64-bit divides are even slower than 32-bit divides. Unlike most other operations, division may take a variable number of cycles depending on the arguments.

Avoid divides and replace with shifts (or let the compiler do it, but you may need to check the assembly) if you can!

BINARY_LSHIFT and BINARY_RSHIFT are simpler processes algorithmically than BINARY_MULTIPLY and BINARY_FLOOR_DIVIDE and may take fewer clock-cycles. That is if you have any binary number and need to bitshift by N, all you have to do is shift the digits over that many spaces and replace with zeros. Binary multiplication is in general more complicated, though techniques like Dadda multiplier make it quite fast.

Granted, it is possible for an optimizing compiler to recognize cases when you multiply/divide by powers of two and replace with the appropriate left/right shift. By looking at disassembled byte code python apparently doesn’t do this:

>>> dis.dis(lambda x: x*4)
  1           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               1 (4)
              6 BINARY_MULTIPLY     
              7 RETURN_VALUE        

>>> dis.dis(lambda x: x<<2)
  1           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               1 (2)
              6 BINARY_LSHIFT       
              7 RETURN_VALUE        


>>> dis.dis(lambda x: x//2)
  1           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               1 (2)
              6 BINARY_FLOOR_DIVIDE 
              7 RETURN_VALUE        

>>> dis.dis(lambda x: x>>1)
  1           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               1 (1)
              6 BINARY_RSHIFT       
              7 RETURN_VALUE

However, on my processor, I find multiplication and left/right shift have similar timing, and floor division (by a power of two) is about 25% slower:

>>> import timeit

>>> timeit.repeat("z=a + 4", setup="a = 37")
[0.03717184066772461, 0.03291916847229004, 0.03287005424499512]

>>> timeit.repeat("z=a - 4", setup="a = 37")
[0.03534698486328125, 0.03207516670227051, 0.03196907043457031]

>>> timeit.repeat("z=a * 4", setup="a = 37")
[0.04594111442565918, 0.0408930778503418, 0.045324087142944336]

>>> timeit.repeat("z=a // 4", setup="a = 37")
[0.05412912368774414, 0.05091404914855957, 0.04910898208618164]

>>> timeit.repeat("z=a << 2", setup="a = 37")
[0.04751706123352051, 0.04259490966796875, 0.041903018951416016]

>>> timeit.repeat("z=a >> 2", setup="a = 37")
[0.04719185829162598, 0.04201006889343262, 0.042105913162231445]

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 07:40

Thẻ: bitwise-operators, operators

Thiết kế website giá rẻ

Danh mục

Speeds of <> multiplication and division

Shifting

Multiplication

Divide