I am having trouble to understand a specific IEEE double computation. Take the following C99 program, that runs on a host with IEEE double (8 bytes, 11 bits biased exponent, 52 bits encoded mantissa):
#include <stdio.h>
#include <fenv.h>
// Expect IEEE double
int _a[sizeof (double) == 8 ? 1 : -1];
#define PP(X) printf (#X " = %dn", X)
int main (void)
{
// Print rounding modes.
PP (FE_TONEAREST);
PP (FE_UPWARD);
PP (FE_DOWNWARD);
PP (FE_TOWARDZERO);
// What mode do we have?
printf ("rounding mode = %dn", fegetround());
// Add a and b.
double a = 0x1.638e38e38e38ep5;
double b = 0x1.638e38e38e38ep6;
__asm ("" : "+r" (a));
__asm ("" : "+r" (b));
printf ("a = %an", a);
printf ("b = %an", b);
printf ("a+b = %an", a + b);
return 0;
}
Compile and run:
$ gcc rounding.c -lm && ./a.out
FE_TONEAREST = 0
FE_UPWARD = 2048
FE_DOWNWARD = 1024
FE_TOWARDZERO = 3072
rounding mode = 0
a = -0x1.638e38e38e38ep+5
b = -0x1.638e38e38e38ep+6
a+b = -0x1.0aaaaaaaaaaaap+7
My question is: Why the least significant nibble of the sum is a
and not rounded to b
as round-to-nearest is on?
In a IEEE double emulation with 56 bits for the decoded mantissa, the computation is like this:
# Printed the double values, same like on the host:
A = -0x1.638e38e38e38ep5
B = -0x1.638e38e38e38ep6
# Internal format with 56-bit mantissa. The digit after | indicates
# three extra bits compared to IEEE.
#
A = -0x1.638e38e38e38e|0, expo = 5
B = -0x1.638e38e38e38e|0, expo = 6
A + B = -0x1.0aaaaaaaaaaaa|8, expo = 7
So when A + B
gets packed as a double, then this would round to
(double) (A + B) = -0x1.0aaaaaaaaaaabp+7
No?