I am programing for ATtiny13 and I have to do a lot of saturating additions.
Trying to optimize them, it seems that avr-gcc just doesn’t know how to optimize anything at all. All of this was tried with AVR gcc 14.1.0 with -O3
. Here’s what I tried so far:
uint8_t saturating_add_1(uint8_t a, uint8_t b) {
uint8_t temp = a + b;
if(temp < a)
return 0xFF;
return temp;
}
This successfully optimizes on x86, however avr-gcc gives us this:
saturating_add_1:
.L__stack_usage = 0
mov r25,r24
add r24,r22
cp r24,r25
brlo .L1
ret
.L1:
ldi r24,lo8(-1)
ret
Not great, not terrible, it does what we told it to.
Let’s try another version that is known to optimize correctly on other architectures:
uint8_t saturating_add_2(uint8_t a, uint8_t b) {
if(b > 255 - a)
return 255;
else return a + b;
}
No, that is even worse:
saturating_add_2:
.L__stack_usage = 0
ldi r18,lo8(-1)
ldi r19,0
sub r18,r24
sbc r19,__zero_reg__
cp r18,r22
cpc r19,__zero_reg__
brlt .L1
add r24,r22
ret
.L1:
ldi r24,lo8(-1)
ret
Fine, I guess we’re trying compiler builtins.
uint8_t saturating_add_builtin(uint8_t a, uint8_t b) {
if(__builtin_add_overflow(a, b, &a))
return 255;
else return a;
}
saturating_add_builtin:
.L__stack_usage = 0
add r22,r24
cp r22,r24
brlo .L1
mov r24,r22
ret
.L1:
ldi r24,lo8(-1)
ret
It generates more or less the same assembly as our first try. I expect it to not compare but use the brcs
or brcc
instruction (branch if carry set/clear).
Maybe we can force it?
uint8_t saturating_add_reg(uint8_t a, uint8_t b) {
uint8_t temp = a + b;
if(SREG & 1)
return 255;
return temp;
}
saturating_add_reg:
.L__stack_usage = 0
add r24,r22
in __tmp_reg__,__SREG__
sbrs __tmp_reg__,0
ret
ldi r24,lo8(-1)
ret
}
This is somewhat better, from 7 instructions down to 6. But avr-gcc trips me up again – Why does it use sbrs
to skip ret
instead of using sbrc
to skip the ldi
? Am I missing something?
Anyway, I also tried to fix it with inline assembly, however it is a bit unwieldly:
uint8_t saturating_add_asm_1(uint8_t a, uint8_t b) {
asm (
"add %[a], %[b]nt"
"brcc no_overflow_%=nt"
"ldi %[a], 255nt"
"no_overflow_%=:"
: [a] "+r" (a)
: [b] "r" (b)
: "cc"
);
return a;
}
This works fine, but the compiler cannot optimize for constants (with subi
), which after all the time I spent on this hurts on an emotional level. My other try is:
uint8_t saturating_add_asm_2(uint8_t a, uint8_t b) {
uint8_t temp = a + b;
asm (
"brcc no_overflow_%=nt"
"ldi %[temp], 255nt"
"no_overflow_%=:"
: [temp] "+r" (temp)
:
:
);
return temp;
}
But this seems like it could break because of compiler code reordering? But we cannot make the asm
block volatile
, because that disables even more optimizations.
Thus, my questions are these:
Has anyone been able to to get avr-gcc to optimize this correctly without inline assembly?
Is there a correct way to optimize it with inline assembly so it optimizes for constants?