When I test out the difference in time between shifting and multiplying in C, there is no difference. Why?

I have been taught that shifting in binary is much more efficient than multiplying by 2^k. So I wanted to experiment, and I used the following code to test this out:

#include <time.h>
#include <stdio.h>

int main() {
    clock_t launch = clock();
    int test = 0x01;
    int runs;

    //simple loop that oscillates between int 1 and int 2
    for (runs = 0; runs < 100000000; runs++) {


    // I first compiled + ran it a few times with this:
    test *= 2;

    // then I recompiled + ran it a few times with:
    test <<= 1;

    // set back to 1 each time
    test >>= 1;
    }

    clock_t done = clock();
    double diff = (done - launch);
    printf("%fn",diff);
}

For both versions, the print out was approximately 440000, give or take 10000. There was no (visually, at least) significant difference between the two versions’ outputs. So my question is, is there something wrong with my methodology? Should there even be a visual difference? Does this have something to do with the architecture of my computer, the compiler, or something else?

13

As said in the other answer, most compilers will automatically optimize multiplications to be done with bitshifts.

This is a very general rule when optimizing:
Most ‘optimizations’ will actually misguide the compile about what you really mean, and might even lessen the performance.

Only optimize when you have noticed a performance problem and measured what the problem is. (and most code we write doesn’t get executed that often, so we don’t need to bother)

The big downside to optimizing is that the ‘optimized’ code is often much less readable. So in your case, always go for multiplication when you are looking to multiply. And go for bit shifting when you want to move bits.

11

The compiler recognizes constants and converts multiplies to shifts where appropriate.

4

Whether shifting is faster than multiplication depends on the architecture of your CPU. Back in the days of the Pentium and earlier, shifting was often faster than multiplication, depending on the number of 1 bits in your multiplicand. For example, if your multiplicand was 320, that’s 101000000, two bits.

a *= 320;               // Slower
a = (a<<7) + (a<<9);    // Faster

But if you had more than two bits …

a *= 324;                        // About same speed
a = (a<<2) + (a<<7) + (a<<9);    // About same speed

a *= 340;                                 // Faster
a = (a<<2) + (a<<4) + (a<<7) + (a<<9);    // Slower

On a little microcontroller like a PIC18 with single cycle multiply, but no barrel shifter, multiplication is faster if you’re shifting by more than 1 bit.

a  *= 2;   // Exactly the same speed
a <<= 1;   // Exactly the same speed

a  *= 4;   // Faster
a <<= 2;   // Slower

Note that that’s the opposite of what was true on older Intel CPUs.

But it’s still not that simple. If I remember correctly, due to its Superscalar architecture, a Pentium was able to process either one multiply instruction or two shift instructions simultaneously (as long as they weren’t dependent on each other). This means that if you wanted to multiply two variables by a power of 2, then shifting might be better.

a  *= 4;   // 
b  *= 4;   // 

a <<= 2;   // Both lines execute in a single cycle
b <<= 2;   // 

1

You’ve got several problems with your test program.

First, you’re not actually using the value of test. There is no way, within the C standard, that the value of test matters. The optimizer is this completely free to remove it. Once its removed it, your loop is actually empty. The only visible effect would be to set runs = 100000000, but runs also isn’t used. So the optimizer can (and should!) remove the entire loop. Easy fix: also print the computed value. Note that a sufficiently determined optimizer could still optimize away the loop (it relies entirely on constants known at compile time).

Second, you do two operations that cancel each other out. The optimizer is allowed to notice this and cancel them out. Again leaving an empty loop, and removed. This one is downright hard to fix. You can switch to an unsigned int (so overflow isn’t undefined behavior), but that of course just results in 0. And simple things (like, say, test += 1) are easy enough for the optimizer to figure out, and it does.

Finally, you assume that test *= 2 is actually going to be compiled to a multiply. That’s a very simple optimization; if the bitshift is faster, the optimizer will use it instead. To get around this, you’d have to use something like an implementation-specific assembly inline.

Or, I suppose, just check your microprocessor data sheet to see which is faster.

When I checked the assembly output of compiling your program with gcc -S -O3 using version 4.9, the optimizer actually saw through every simple variation above, and several more. In all cases, it removed the loop (assigning a constant to test), the only thing left was the calls to clock(), the convert/subtract, and the printf.

4

I think it would be more helpful for the questioner to have a more differentiated answer, because I see several unexamined assumptions in the questions and in some of the answers or comments.

The resulting relative runtime of shifting and multiplication has nothing to do with C. When I say C, I do not mean the instance of a specific implementation, such as that or that version of GCC, but the language. I do not mean to take this ad absurdum, but to use an extreme example for illustration: you could implement a completely standards compliant C compiler and have multiplication take an hour, while shifting takes milliseconds – or the other way around. I am not aware of any such performance restrictions in C or C++.

You may not care about this technicality in argumentation. Your intention was probably to just test out the relative performance of doing shifts versus multiplications and you chose C, because it is generally perceived as a low level programming language, so one may expect its source code to translate into corresponding instructions more directly. Such questions are very common and I think a good answer should point out that even in C your source code does not translate into instructions as directly as you may think in a given instance. I have given you some possible compilation results below.

This is where comments that question the usefulness of substituting this equivalence in real-world software come in. You can see some in the comments to your question, such as the one from Eric Lippert. It is in line with the reaction you will generally get from more seasoned engineers in response to such optimizations. If you use binary shifts in production code as a blanket means of multiplying and dividing, people will most likely cringe at your code and have some degree of emotional reaction (“I have heard this nonsensical claim made about JavaScript for heaven’s sake.”) to it that may not make sense to novice programmers, unless they better understand the reasons for those reactions.

Those reasons are primarily a combination of the decreased readability and futility of such optimization, as you may have found out already with comparing their relative performance. However, I do not think that people would have as strong of a reaction if substitution of shift for multiplication was the only example of such optimizations. Questions like yours frequently come up in various forms and in various contexts. I think what more senior engineers actually react to so strongly, at least I have at times, is that there is potential for a much wider range of harm when people employ such micro-optimizations liberally across the code base. If you work at a company like Microsoft on a large code base, you will spend a lot of time reading other engineers’ source code, or attempt to locate certain code therein. It may even be your own code that you will be trying to make sense of in a few years’ time, particularly at some of the most inopportune times, such as when you have to fix a production outage following a call you received being on pager duty on a Friday night, about to head out for a night of fun with friends … If you spend that much time on reading code, you will appreciate it being as readable as possible. Imagine reading your favorite novel, but the publisher has decided to release a new edition where they use abbrv. all ovr th plc bcs thy thnk it svs spc. That is akin to the reactions other engineers may have to your code, if you sprinkle them with such optimizations. As other answers have pointed out, it is better to clearly state what you mean, which is using the operation in code that is semantically closest to what you are trying to express in idea.

Even in those environments, though, you may find yourself solving an interview question where you are expected to know this or some other equivalence. Knowing them is not bad and a good engineer would be aware of the arithmetic effect of binary shifting. Note that I did not say that this makes a good engineer, but that a good engineer would know, in my opinion. In particular, you may still find some manager, usually towards the end of your interview loop, who will grin broadly at you in anticipation of the delight to reveal this smart engineering “trick” to you in a coding question and prove that he/she, too, used to be or is one of the savvy engineers and not “just” a manager. In those situations, just try to look impressed and thank him/her for the enlightening interview.

Why did you not see a speed difference in C? The most likely answer is that it both resulted in the same assembly code:

int shift(int i) { return i << 2; }
int multiply(int i) { return i * 2; }

Can both compile into

shift(int):
    lea eax, [0+rdi*4]
    ret

On GCC without optimizations, i.e. using the flag “-O0”, you may get this:

shift(int):
    push    rbp
    mov rbp, rsp
    mov DWORD PTR [rbp-4], edi
    mov eax, DWORD PTR [rbp-4]
    sal eax, 2
    pop rbp
    ret
multiply(int):
    push    rbp
    mov rbp, rsp
    mov DWORD PTR [rbp-4], edi
    mov eax, DWORD PTR [rbp-4]
    add eax, eax
    pop rbp
    ret

As you can see, passing “-O0” to GCC does not mean that it will not be somewhat smart about what kind of code it produces. In particular, notice that even in this case the compiler avoided the use of a multiply instruction. You can repeat the same experiment with shifts by other numbers and even multiplications by numbers that are not powers of two. Chances are that on your platform you will see a combination of shifts and additions, but no multiplications. It seems like a bit of a coincidence for the compiler to apparently avoid using multiplications in all those cases if multiplications and shifts really had the same cost, does it not? But I do not mean to supply supposition for proof, so let us move on.

You could rerun your test with the above code and see if you notice a speed difference now. Even then you are not testing shift versus multiply, as you can see by the absence of a multiplication, though, but the code that was generated with a certain set of flags by GCC for the C operations of shift and multiply in a particular instance. So, in another test you could edit the assembly code by hand and instead use a “imul” instruction in the code for the “multiply” method.

If you wanted to defeat some of those smarts of the compiler, you could define a more general shift and multiply method and will end up with something like this:

int shift(int i, int j) { return i << j; }
int multiply(int i, int j) { return i * j; }

Which may yield the following assembly code:

shift(int, int):
    mov eax, edi
    mov ecx, esi
    sal eax, cl
    ret
multiply(int, int):
    mov eax, edi
    imul    eax, esi
    ret

Here we finally have, even on the highest optimization level of GCC 4.9, the expression in assembly instructions you may have expected when you initially set out on your test. I think that in itself can be an important lesson in performance optimization. We can see the difference it made to substitute variables for concrete constants in our code, in terms of the smarts that the compiler is able to apply. Micro-optimizations like the shift-multiply substitution are some very low-level optimizations that a compiler can usually easily do by itself. Other optimizations that are much more impactful on performance require an understanding of the intention of the code that is often not accessible by the compiler or can only be guessed at by some heuristic. That is where you as a software engineer come in and it certainly typically does not involve substituting multiplications with shifts. It involves factors such as avoiding a redundant call to a service that produces I/O and can block a process. If you go to your hard disk or, god forbid, to a remote database for some extra data you could have derived from what you already have in memory, the time you spend waiting outweighs the execution of a million instructions. Now, I think we have strayed a bit far from your original question, but I think pointing this out to a questioner, especially if we suppose someone who is just starting to get a grasp on the translation and execution of code, can be extremely important to address not just the question itself but many of its (possibly) underlying assumptions.

So, which one will be faster? I think it is a good approach you chose to actually test out the performance difference. In general, it is easy to be surprised by the runtime performance of some code changes. There are many techniques modern processors employ and the interaction between software can be complex, too. Even if you should get beneficial performance results for a certain change in one situation, I think it is dangerous to conclude that this type of change will always yield performance benefits. I think it is dangerous to run such tests once, say “Okay, now I know which one’s faster!” and then indiscriminately apply that same optimization to production code without repeating your measurements.

So what if the shift is faster than the multiplication? There are certainly indications why that would be true. GCC, as you can see above, appears to think (even without optimization) that avoiding direct multiplication in favor of other instructions is a good idea. The Intel 64 and IA-32 Architectures Optimization Reference Manual will give you an idea of the relative cost of CPU instructions. Another resource, more focused on instruction latency and throughput, is http://www.agner.org/optimize/instruction_tables.pdf. Note that they are not a good predicator of absolute runtime, but of performance of instructions relative to each other. In a tight loop, as your test is simulating, the metric of “throughput” should be most relevant. It is the number of cycles that an execution unit will typically be tied up for when executing a given instruction.

So what if the shift is NOT faster than the multiplication? As I said above, modern architectures can be quite complex and things such as branch prediction, caching, pipelining, and parallel execution units can make it hard to predict the relative performance of two logically equivalent pieces of code at times. I really want to emphasize this, because this is where I am not happy with most answers to questions like these and with the camp of people outright saying that it is simply not true (anymore) that shifting is faster than multiplication.

No, as far as I am aware we did not invent some secret engineering sauce in the 1970’s or whenever to suddenly annul the cost difference of a multiplication unit and a bit shifter. A general multiplication, in terms of logical gates, and certainly in terms of logical operations, is still more complex than a shift with a barrel shifter in many scenarios, on many architectures. How this translates into overall runtime on a desktop computer may be a bit opaque. I do not know for sure how they are implemented in specific processors, but here is an explanation of a multiplication: Is integer multiplication really same speed as addition on modern CPU

While here is an explanation of a Barrel Shifter. The documents I have referenced in the previous paragraph give another view on the relative cost of operations, by proxy of CPU instructions. The engineers over at Intel frequently seem to get similar questions: intel developer zone forums clock cycles for integer multiplication and addition in core 2 duo processor

Yes, in most real-life scenarios, and almost certainly in JavaScript, attempting to exploit this equivalence for performance’s sake is probably a futile undertaking. However, even if we forced the use of multiplication instructions and then saw no difference in run-time, that is more due to the nature of the cost metric we used, to be precise, and not because there is no cost difference. End-to-end runtime is one metric and if it is the only one we care about, all is well. But that does not mean that all cost differences between multiplication and shifting have simply disappeared. And I think it is certainly not a good idea to convey that idea to a questioner, by implication or otherwise, who is obviously just starting to get an idea of the factors involved in the run-time and cost of modern code. Engineering is always about trade-offs. Inquiry into and explanation as to what tradeoffs modern processors have made to exhibit the execution time that we as users end up seeing may yield a more differentiated answer. And I think a more differentiated answer than “this is simply not true anymore” is warranted if we want to see fewer engineers check in micro-optimized code obliterating readability, because it takes a more general understanding of the nature of such “optimizations” to spot its various, diverse incarnations than simply referring to some specific instances as out of date.

What you see is the effect of the optimiser.

The optimisers job is to make the resulting compiled code either smaller, or faster (but rarely both at the same time… but like many things… IT DEPENDS on what the code is).

In PRINCIPLE, any call to a multiplication library, or, frequently, even use of a hardware multiplier will be slower than just doing a bitwise shift.

So… if the naive compiler generated a call to a library for the operation *2, then of course it would run slower than a bitwise shift*.

However optimisers are there to detect patterns and figure out how to make the code smaller/faster/ whatever. And what you have seen is the compiler detecting that *2 is the same as a shift.

Just as a matter of interest I was just today looking at the generated assembler for some operations like *5… not actually looking at that but other things, and along the way I notice that the compiler had turned *5 into:

  • shift
  • shift
  • add original number

So the optimiser of my compiler was smart enough (at least for certain small constants) to generate inline shifts and adds instead of calls to a general purpose multiply library.

The art of compiler optimisers is a whole separate subject, filled with magic, and really properly understood by about 6 people on the whole planet 🙂

Try timing it with:

for (runs = 0; runs < 100000000; runs++) {
      ;
}

The compiler should be recognizing that the value of test is unchanged after each iteration of the loop, and the final value of test is unused, and eliminating the loop entirely.

Multiplication is a combination of shifts and additions.

In the case you’ve mentioned, I don’t believe it matters whether the compiler optimises it or not – “multiply x by two” can be implemented as either:

  • Shift bits of x one place to the left.
  • Add x to x.

These are each basic atomic operations; one isn’t faster than the other.

Change it to “multiply x by four”, (or any 2^k, k>1) and it’s a little different:

  • Shift bits of x two places to the left.
  • Add x to x and call it y, add y to y.

On a basic architecture, it’s simple to see that the shift is more efficient – taking one vs. two operations, since we cannot add y to y until we know what y is.

Try the latter (or any 2^k, k>1), with appropriate options to prevent your from optimising them to be the same thing in implementation. You should find the shift is faster, taking O(1) compared to the repeated addition in O(k).

Obviously, where the multiplicand is not a power of two, a combination of shifts and additions (one where the number of each is non-zero) is necessary.

6

Multiplication of signed or unsigned values by powers of two is equivalent to left-shifting, and most compilers will make the substitution. Division of unsigned values, or signed values which the compiler can prove are never negative, is equivalent to right-shifting, and most compilers will make that substitution (though some aren’t sophisticated enough to prove when signed values can’t be negative).

It should be noted, however, that division of potentially-negative signed values is not equivalent to right-shifting. An expression like (x+8)>>4 is not equivalent to (x+8)/16. The former, in 99% of compilers, will map values from -24 to -9 to -1, -8 to +7 to 0, and +8 to +23 to 1 [rounding numbers almost symmetrically about zero]. The latter will map -39 to -24 to -1, -23 to +7 to 0, and +8 to +23 to +1 [grossly asymmetrical, and likely not what was intended]. Note that even when values are not expected to be negative, use of >>4 will likely yield faster code than /16 unless the compiler can prove the values can’t be negative.

Some more info I just checked out.

On x86_64, the MUL opcode has a 10 cycle latency and 1/2 cycle throughput. MOV, ADD and SHL have a latency of 1 cycle, with 2.5, 2.5, and 1.7 cycle throughput.

A multiplication by 15 would require 3 SHL and 3 ADD ops at minimum and probably a couple of MOVs.

https://gmplib.org/~tege/x86-timing.pdf

Your methodology is flawed. Your loop increment and condition checking itself is taking that much time.

  • Try running an empty loop and measure the time (call it base).
  • Now add 1 shift operation and measure the time (call it s1).
  • Next add 10 shift operations and measure the time (call it s2)

If everything is going correctly base-s2 should be 10 times more than base-s1. Otherwise something else is coming into play here.

Now I actually tried this myself and figured, If loops are causing a problem why not remove them entirely. So I went ahead and did this :

int main(){

    int test = 2;
    clock_t launch = clock();

    test << 6;
    test << 6;
    test << 6;
    test << 6;
    //.... 1 million times
    test << 6;

    clock_t done = clock();
    printf("Time taken : %dn", done - launch);
    return 0;
}

And there you have your result

1 million shift operations in under 1 miliseconds ?.

I did the same thing for multiplication by 64 and got the same result. So probably the compiler is ignoring the operation completely as others mentioned the value of test is never changed.

Shiftwise Operator Result

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị
Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa

When I test out the difference in time between shifting and multiplying in C, there is no difference. Why?

I have been taught that shifting in binary is much more efficient than multiplying by 2^k. So I wanted to experiment, and I used the following code to test this out:

#include <time.h>
#include <stdio.h>

int main() {
    clock_t launch = clock();
    int test = 0x01;
    int runs;

    //simple loop that oscillates between int 1 and int 2
    for (runs = 0; runs < 100000000; runs++) {


    // I first compiled + ran it a few times with this:
    test *= 2;

    // then I recompiled + ran it a few times with:
    test <<= 1;

    // set back to 1 each time
    test >>= 1;
    }

    clock_t done = clock();
    double diff = (done - launch);
    printf("%fn",diff);
}

For both versions, the print out was approximately 440000, give or take 10000. There was no (visually, at least) significant difference between the two versions’ outputs. So my question is, is there something wrong with my methodology? Should there even be a visual difference? Does this have something to do with the architecture of my computer, the compiler, or something else?

13

As said in the other answer, most compilers will automatically optimize multiplications to be done with bitshifts.

This is a very general rule when optimizing:
Most ‘optimizations’ will actually misguide the compile about what you really mean, and might even lessen the performance.

Only optimize when you have noticed a performance problem and measured what the problem is. (and most code we write doesn’t get executed that often, so we don’t need to bother)

The big downside to optimizing is that the ‘optimized’ code is often much less readable. So in your case, always go for multiplication when you are looking to multiply. And go for bit shifting when you want to move bits.

11

The compiler recognizes constants and converts multiplies to shifts where appropriate.

4

Whether shifting is faster than multiplication depends on the architecture of your CPU. Back in the days of the Pentium and earlier, shifting was often faster than multiplication, depending on the number of 1 bits in your multiplicand. For example, if your multiplicand was 320, that’s 101000000, two bits.

a *= 320;               // Slower
a = (a<<7) + (a<<9);    // Faster

But if you had more than two bits …

a *= 324;                        // About same speed
a = (a<<2) + (a<<7) + (a<<9);    // About same speed

a *= 340;                                 // Faster
a = (a<<2) + (a<<4) + (a<<7) + (a<<9);    // Slower

On a little microcontroller like a PIC18 with single cycle multiply, but no barrel shifter, multiplication is faster if you’re shifting by more than 1 bit.

a  *= 2;   // Exactly the same speed
a <<= 1;   // Exactly the same speed

a  *= 4;   // Faster
a <<= 2;   // Slower

Note that that’s the opposite of what was true on older Intel CPUs.

But it’s still not that simple. If I remember correctly, due to its Superscalar architecture, a Pentium was able to process either one multiply instruction or two shift instructions simultaneously (as long as they weren’t dependent on each other). This means that if you wanted to multiply two variables by a power of 2, then shifting might be better.

a  *= 4;   // 
b  *= 4;   // 

a <<= 2;   // Both lines execute in a single cycle
b <<= 2;   // 

1

You’ve got several problems with your test program.

First, you’re not actually using the value of test. There is no way, within the C standard, that the value of test matters. The optimizer is this completely free to remove it. Once its removed it, your loop is actually empty. The only visible effect would be to set runs = 100000000, but runs also isn’t used. So the optimizer can (and should!) remove the entire loop. Easy fix: also print the computed value. Note that a sufficiently determined optimizer could still optimize away the loop (it relies entirely on constants known at compile time).

Second, you do two operations that cancel each other out. The optimizer is allowed to notice this and cancel them out. Again leaving an empty loop, and removed. This one is downright hard to fix. You can switch to an unsigned int (so overflow isn’t undefined behavior), but that of course just results in 0. And simple things (like, say, test += 1) are easy enough for the optimizer to figure out, and it does.

Finally, you assume that test *= 2 is actually going to be compiled to a multiply. That’s a very simple optimization; if the bitshift is faster, the optimizer will use it instead. To get around this, you’d have to use something like an implementation-specific assembly inline.

Or, I suppose, just check your microprocessor data sheet to see which is faster.

When I checked the assembly output of compiling your program with gcc -S -O3 using version 4.9, the optimizer actually saw through every simple variation above, and several more. In all cases, it removed the loop (assigning a constant to test), the only thing left was the calls to clock(), the convert/subtract, and the printf.

4

I think it would be more helpful for the questioner to have a more differentiated answer, because I see several unexamined assumptions in the questions and in some of the answers or comments.

The resulting relative runtime of shifting and multiplication has nothing to do with C. When I say C, I do not mean the instance of a specific implementation, such as that or that version of GCC, but the language. I do not mean to take this ad absurdum, but to use an extreme example for illustration: you could implement a completely standards compliant C compiler and have multiplication take an hour, while shifting takes milliseconds – or the other way around. I am not aware of any such performance restrictions in C or C++.

You may not care about this technicality in argumentation. Your intention was probably to just test out the relative performance of doing shifts versus multiplications and you chose C, because it is generally perceived as a low level programming language, so one may expect its source code to translate into corresponding instructions more directly. Such questions are very common and I think a good answer should point out that even in C your source code does not translate into instructions as directly as you may think in a given instance. I have given you some possible compilation results below.

This is where comments that question the usefulness of substituting this equivalence in real-world software come in. You can see some in the comments to your question, such as the one from Eric Lippert. It is in line with the reaction you will generally get from more seasoned engineers in response to such optimizations. If you use binary shifts in production code as a blanket means of multiplying and dividing, people will most likely cringe at your code and have some degree of emotional reaction (“I have heard this nonsensical claim made about JavaScript for heaven’s sake.”) to it that may not make sense to novice programmers, unless they better understand the reasons for those reactions.

Those reasons are primarily a combination of the decreased readability and futility of such optimization, as you may have found out already with comparing their relative performance. However, I do not think that people would have as strong of a reaction if substitution of shift for multiplication was the only example of such optimizations. Questions like yours frequently come up in various forms and in various contexts. I think what more senior engineers actually react to so strongly, at least I have at times, is that there is potential for a much wider range of harm when people employ such micro-optimizations liberally across the code base. If you work at a company like Microsoft on a large code base, you will spend a lot of time reading other engineers’ source code, or attempt to locate certain code therein. It may even be your own code that you will be trying to make sense of in a few years’ time, particularly at some of the most inopportune times, such as when you have to fix a production outage following a call you received being on pager duty on a Friday night, about to head out for a night of fun with friends … If you spend that much time on reading code, you will appreciate it being as readable as possible. Imagine reading your favorite novel, but the publisher has decided to release a new edition where they use abbrv. all ovr th plc bcs thy thnk it svs spc. That is akin to the reactions other engineers may have to your code, if you sprinkle them with such optimizations. As other answers have pointed out, it is better to clearly state what you mean, which is using the operation in code that is semantically closest to what you are trying to express in idea.

Even in those environments, though, you may find yourself solving an interview question where you are expected to know this or some other equivalence. Knowing them is not bad and a good engineer would be aware of the arithmetic effect of binary shifting. Note that I did not say that this makes a good engineer, but that a good engineer would know, in my opinion. In particular, you may still find some manager, usually towards the end of your interview loop, who will grin broadly at you in anticipation of the delight to reveal this smart engineering “trick” to you in a coding question and prove that he/she, too, used to be or is one of the savvy engineers and not “just” a manager. In those situations, just try to look impressed and thank him/her for the enlightening interview.

Why did you not see a speed difference in C? The most likely answer is that it both resulted in the same assembly code:

int shift(int i) { return i << 2; }
int multiply(int i) { return i * 2; }

Can both compile into

shift(int):
    lea eax, [0+rdi*4]
    ret

On GCC without optimizations, i.e. using the flag “-O0”, you may get this:

shift(int):
    push    rbp
    mov rbp, rsp
    mov DWORD PTR [rbp-4], edi
    mov eax, DWORD PTR [rbp-4]
    sal eax, 2
    pop rbp
    ret
multiply(int):
    push    rbp
    mov rbp, rsp
    mov DWORD PTR [rbp-4], edi
    mov eax, DWORD PTR [rbp-4]
    add eax, eax
    pop rbp
    ret

As you can see, passing “-O0” to GCC does not mean that it will not be somewhat smart about what kind of code it produces. In particular, notice that even in this case the compiler avoided the use of a multiply instruction. You can repeat the same experiment with shifts by other numbers and even multiplications by numbers that are not powers of two. Chances are that on your platform you will see a combination of shifts and additions, but no multiplications. It seems like a bit of a coincidence for the compiler to apparently avoid using multiplications in all those cases if multiplications and shifts really had the same cost, does it not? But I do not mean to supply supposition for proof, so let us move on.

You could rerun your test with the above code and see if you notice a speed difference now. Even then you are not testing shift versus multiply, as you can see by the absence of a multiplication, though, but the code that was generated with a certain set of flags by GCC for the C operations of shift and multiply in a particular instance. So, in another test you could edit the assembly code by hand and instead use a “imul” instruction in the code for the “multiply” method.

If you wanted to defeat some of those smarts of the compiler, you could define a more general shift and multiply method and will end up with something like this:

int shift(int i, int j) { return i << j; }
int multiply(int i, int j) { return i * j; }

Which may yield the following assembly code:

shift(int, int):
    mov eax, edi
    mov ecx, esi
    sal eax, cl
    ret
multiply(int, int):
    mov eax, edi
    imul    eax, esi
    ret

Here we finally have, even on the highest optimization level of GCC 4.9, the expression in assembly instructions you may have expected when you initially set out on your test. I think that in itself can be an important lesson in performance optimization. We can see the difference it made to substitute variables for concrete constants in our code, in terms of the smarts that the compiler is able to apply. Micro-optimizations like the shift-multiply substitution are some very low-level optimizations that a compiler can usually easily do by itself. Other optimizations that are much more impactful on performance require an understanding of the intention of the code that is often not accessible by the compiler or can only be guessed at by some heuristic. That is where you as a software engineer come in and it certainly typically does not involve substituting multiplications with shifts. It involves factors such as avoiding a redundant call to a service that produces I/O and can block a process. If you go to your hard disk or, god forbid, to a remote database for some extra data you could have derived from what you already have in memory, the time you spend waiting outweighs the execution of a million instructions. Now, I think we have strayed a bit far from your original question, but I think pointing this out to a questioner, especially if we suppose someone who is just starting to get a grasp on the translation and execution of code, can be extremely important to address not just the question itself but many of its (possibly) underlying assumptions.

So, which one will be faster? I think it is a good approach you chose to actually test out the performance difference. In general, it is easy to be surprised by the runtime performance of some code changes. There are many techniques modern processors employ and the interaction between software can be complex, too. Even if you should get beneficial performance results for a certain change in one situation, I think it is dangerous to conclude that this type of change will always yield performance benefits. I think it is dangerous to run such tests once, say “Okay, now I know which one’s faster!” and then indiscriminately apply that same optimization to production code without repeating your measurements.

So what if the shift is faster than the multiplication? There are certainly indications why that would be true. GCC, as you can see above, appears to think (even without optimization) that avoiding direct multiplication in favor of other instructions is a good idea. The Intel 64 and IA-32 Architectures Optimization Reference Manual will give you an idea of the relative cost of CPU instructions. Another resource, more focused on instruction latency and throughput, is http://www.agner.org/optimize/instruction_tables.pdf. Note that they are not a good predicator of absolute runtime, but of performance of instructions relative to each other. In a tight loop, as your test is simulating, the metric of “throughput” should be most relevant. It is the number of cycles that an execution unit will typically be tied up for when executing a given instruction.

So what if the shift is NOT faster than the multiplication? As I said above, modern architectures can be quite complex and things such as branch prediction, caching, pipelining, and parallel execution units can make it hard to predict the relative performance of two logically equivalent pieces of code at times. I really want to emphasize this, because this is where I am not happy with most answers to questions like these and with the camp of people outright saying that it is simply not true (anymore) that shifting is faster than multiplication.

No, as far as I am aware we did not invent some secret engineering sauce in the 1970’s or whenever to suddenly annul the cost difference of a multiplication unit and a bit shifter. A general multiplication, in terms of logical gates, and certainly in terms of logical operations, is still more complex than a shift with a barrel shifter in many scenarios, on many architectures. How this translates into overall runtime on a desktop computer may be a bit opaque. I do not know for sure how they are implemented in specific processors, but here is an explanation of a multiplication: Is integer multiplication really same speed as addition on modern CPU

While here is an explanation of a Barrel Shifter. The documents I have referenced in the previous paragraph give another view on the relative cost of operations, by proxy of CPU instructions. The engineers over at Intel frequently seem to get similar questions: intel developer zone forums clock cycles for integer multiplication and addition in core 2 duo processor

Yes, in most real-life scenarios, and almost certainly in JavaScript, attempting to exploit this equivalence for performance’s sake is probably a futile undertaking. However, even if we forced the use of multiplication instructions and then saw no difference in run-time, that is more due to the nature of the cost metric we used, to be precise, and not because there is no cost difference. End-to-end runtime is one metric and if it is the only one we care about, all is well. But that does not mean that all cost differences between multiplication and shifting have simply disappeared. And I think it is certainly not a good idea to convey that idea to a questioner, by implication or otherwise, who is obviously just starting to get an idea of the factors involved in the run-time and cost of modern code. Engineering is always about trade-offs. Inquiry into and explanation as to what tradeoffs modern processors have made to exhibit the execution time that we as users end up seeing may yield a more differentiated answer. And I think a more differentiated answer than “this is simply not true anymore” is warranted if we want to see fewer engineers check in micro-optimized code obliterating readability, because it takes a more general understanding of the nature of such “optimizations” to spot its various, diverse incarnations than simply referring to some specific instances as out of date.

What you see is the effect of the optimiser.

The optimisers job is to make the resulting compiled code either smaller, or faster (but rarely both at the same time… but like many things… IT DEPENDS on what the code is).

In PRINCIPLE, any call to a multiplication library, or, frequently, even use of a hardware multiplier will be slower than just doing a bitwise shift.

So… if the naive compiler generated a call to a library for the operation *2, then of course it would run slower than a bitwise shift*.

However optimisers are there to detect patterns and figure out how to make the code smaller/faster/ whatever. And what you have seen is the compiler detecting that *2 is the same as a shift.

Just as a matter of interest I was just today looking at the generated assembler for some operations like *5… not actually looking at that but other things, and along the way I notice that the compiler had turned *5 into:

  • shift
  • shift
  • add original number

So the optimiser of my compiler was smart enough (at least for certain small constants) to generate inline shifts and adds instead of calls to a general purpose multiply library.

The art of compiler optimisers is a whole separate subject, filled with magic, and really properly understood by about 6 people on the whole planet 🙂

Try timing it with:

for (runs = 0; runs < 100000000; runs++) {
      ;
}

The compiler should be recognizing that the value of test is unchanged after each iteration of the loop, and the final value of test is unused, and eliminating the loop entirely.

Multiplication is a combination of shifts and additions.

In the case you’ve mentioned, I don’t believe it matters whether the compiler optimises it or not – “multiply x by two” can be implemented as either:

  • Shift bits of x one place to the left.
  • Add x to x.

These are each basic atomic operations; one isn’t faster than the other.

Change it to “multiply x by four”, (or any 2^k, k>1) and it’s a little different:

  • Shift bits of x two places to the left.
  • Add x to x and call it y, add y to y.

On a basic architecture, it’s simple to see that the shift is more efficient – taking one vs. two operations, since we cannot add y to y until we know what y is.

Try the latter (or any 2^k, k>1), with appropriate options to prevent your from optimising them to be the same thing in implementation. You should find the shift is faster, taking O(1) compared to the repeated addition in O(k).

Obviously, where the multiplicand is not a power of two, a combination of shifts and additions (one where the number of each is non-zero) is necessary.

6

Multiplication of signed or unsigned values by powers of two is equivalent to left-shifting, and most compilers will make the substitution. Division of unsigned values, or signed values which the compiler can prove are never negative, is equivalent to right-shifting, and most compilers will make that substitution (though some aren’t sophisticated enough to prove when signed values can’t be negative).

It should be noted, however, that division of potentially-negative signed values is not equivalent to right-shifting. An expression like (x+8)>>4 is not equivalent to (x+8)/16. The former, in 99% of compilers, will map values from -24 to -9 to -1, -8 to +7 to 0, and +8 to +23 to 1 [rounding numbers almost symmetrically about zero]. The latter will map -39 to -24 to -1, -23 to +7 to 0, and +8 to +23 to +1 [grossly asymmetrical, and likely not what was intended]. Note that even when values are not expected to be negative, use of >>4 will likely yield faster code than /16 unless the compiler can prove the values can’t be negative.

Some more info I just checked out.

On x86_64, the MUL opcode has a 10 cycle latency and 1/2 cycle throughput. MOV, ADD and SHL have a latency of 1 cycle, with 2.5, 2.5, and 1.7 cycle throughput.

A multiplication by 15 would require 3 SHL and 3 ADD ops at minimum and probably a couple of MOVs.

https://gmplib.org/~tege/x86-timing.pdf

Your methodology is flawed. Your loop increment and condition checking itself is taking that much time.

  • Try running an empty loop and measure the time (call it base).
  • Now add 1 shift operation and measure the time (call it s1).
  • Next add 10 shift operations and measure the time (call it s2)

If everything is going correctly base-s2 should be 10 times more than base-s1. Otherwise something else is coming into play here.

Now I actually tried this myself and figured, If loops are causing a problem why not remove them entirely. So I went ahead and did this :

int main(){

    int test = 2;
    clock_t launch = clock();

    test << 6;
    test << 6;
    test << 6;
    test << 6;
    //.... 1 million times
    test << 6;

    clock_t done = clock();
    printf("Time taken : %dn", done - launch);
    return 0;
}

And there you have your result

1 million shift operations in under 1 miliseconds ?.

I did the same thing for multiplication by 64 and got the same result. So probably the compiler is ignoring the operation completely as others mentioned the value of test is never changed.

Shiftwise Operator Result

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị
Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa
Thiết kế website Thiết kế website Thiết kế website Cách kháng tài khoản quảng cáo Mua bán Fanpage Facebook Dịch vụ SEO Tổ chức sinh nhật