Solutions for floating point rounding errors

In building an application that deals with a lot of mathematical calculations, I have encountered the problem that certain numbers cause rounding errors.

While I understand that floating point is not exact, the problem is how do I deal with exact numbers to make sure that when calculations are preformed on them floating point rounding doesn’t cause any issues?

There are three fundamental approaches to creating alternative numeric types that are free of floating point rounding. The common theme with these is that they use integer math instead in various ways.

Rationals

Represent the number as a whole part and rational number with a numerator and a denominator. The number 15.589 would be represented as w: 15; n: 589; d:1000.

When added to 0.25 (which is w: 0; n: 1; d: 4), this involves calculating the LCM, and then adding the two numbers. This works well for many situations, though can result in very large numbers when you are working with many rational numbers that are relatively prime to each other.

Fixed point

You have the whole part, and the decimal part. All numbers are rounded (there’s that word – but you know where it is) to that precision. For example, you could have fixed point with 3 decimal points. 15.589 + 0.250 becomes adding 589 + 250 % 1000 for the decimal part (and then any carry to the whole part). This works very nicely with existing databases. As mentioned, there is rounding but you know where it is and can specify it such that it is more precise than is needed (you are only measuring to 3 decimal points, so make it fixed 4).

Floating fixed point

Store a value and the precision. 15.589 is stored as 15589 for the value and 3 for the precision, while 0.25 is stored as 25 and 2. This can handle arbitrary precision. I believe this is what the internals of Java’s BigDecimal uses (haven’t looked at it recently) uses. At some point, you will want to get it back out of this format and display it – and that may involve rounding (again, you control where it is).

Once you determine the choice for the representation, you can either find existing third party libraries that use this, or write your own. When writing your own, be sure to unit test it and make sure you are doing the math correctly.

If floating point values have rounding problems, and you don’t want to have to run into rounding problems, it logically follows that the only course of action is to not use floating point values.

Now the question becomes, “how do I do math involving non-integer values without floating point variables?” The answer is with arbitrary-precision data types. Calculations are slower because they have to be implemented in software instead of in hardware, but they’re accurate. You didn’t say what language you’re using, so I can’t recommend a package, but there are arbitrary precision libraries available for most popular programming languages.

Floating point arithmetic is usually quite precise (15 decimal digits for a double) and quite flexible. The problems crop up when you are doing math that significantly reduces the amount of digits of precision. Here are some examples:

Cancelation on subtraction: 1234567890.12345 - 1234567890.12300, the result 0.0045 has only two decimal digits of precision. This strikes whenever you subtract two numbers of similar magnitude.
Swallowing of precision: 1234567890.12345 + 0.123456789012345 evaluates to 1234567890.24691, the last ten digits of the second operand are lost.
Multiplications: If you multiply two 15 digit numbers, the result has 30 digits that need to be stored. But you can’t store them, so the last 15 bits are lost. This is especially irksome when combined with a sqrt() (as in sqrt(x*x + y*y): The result will only have 7.5 digits of precision.

These are the main pitfalls that you need to be aware of. And once you are aware of them, you can try to formulate your math in a way that avoids them. For examle, if you need to increment a value over and over again in a loop, avoid doing this:

for(double f = f0; f < f1; f += df) {

After a few iterations, the larger f will swallow part of the precision of df. Worse, the errors will add up, leading to the contraintuitive situation that a smaller df may lead to worse overall results. Better write this:

for(int i = 0; i < (f1 - f0)/df; i++) {
    double f = f0 + i*df;

Because you are combining the increments in a single multiplication, the resulting f will be precise to 15 decimal digits.

This is only an example, there are other ways to avoid loss of precision due to other reasons. But it helps already a lot to think about the magnitude of the involved values, and to imagine what would happen if you were to do your math with pen and paper, rounding to a fixed number of digits after every step.

How to make sure that you don’t have problems: Learn about floating-point arithmetic problems, or hire someone who does, or use some common sense.

The first problem is precision. In many languages you have “float” and “double” (double standing for “double precision”), and in many cases “float” gives you about 7 digits precision, while double gives you 15. Common sense is that if you have a situation where precision might be a problem, 15 digits is an awful lot better than 7 digits. In many slightly problematic situations, using “double” means you get away with it, and “float” means you don’t. Let’s say a company’s market caps is 700 billion dollars. Represent this in float, and the lowest bit is $65536. Represent it using double, and the lowest bit is about 0.012 cents. So unless you really, really know what you are doing, you use double, not float.

The second problem is more a matter of principle. If you do two different calculations that should give the same result, they often don’t because of rounding errors. Two result that should be equal will be “almost equal”. If two results are close together, then the real values might be equal. Or they might be not. You need to keep that in mind and should write and use functions that say “x is definitely greater than y” or “x is definitely less than y” or “x and y might be equal”.

This problem gets a lot worse if you use rounding, for example “round x down to the nearest integer”. If you multiply 120 * 0.05, the result should be 6, but what you get is “some number very close to 6”. If you then “round down to the nearest integer”, that “number very close to 6” might be “slightly less than 6” and get rounded to 5. And note that it doesn’t matter how much precision you have. Doesn’t matter how close to 6 your result is, as long as it is less than 6.

And third, some problems are difficult. That means there is no quick and easy rule. If your compiler supports “long double” with more precision you can use “long double” and see if it makes a difference. If it makes no difference, then either you are Ok, or you have a real tricky problem. If it makes the kind of difference that you would expect (like a change at the 12th decimal) then you are likely alright. If it really changes your results, then you have a problem. Ask for help.

Most people make the mistake when they see double they scream BigDecimal, when in fact
they’ve just moved the problem elsewhere. Double gives Sign bit: 1 bit, Exponent width: 11 bits.
Significand precision: 53 bits (52 explicitly stored). Due to the nature of double, the larger
the whole interger you lose relative accuracy. To calculate the relative accuracy we use here
is bellow.

Relative accuracy of double in the calculation we use the following foluma
2^E <= abs(X) < 2^(E+1)

epsilon = 2^(E-10) % For a 16-bit float (half precision)

 Accuracy Power | Accuracy -/+| Maximum Power | Max Interger Value
 2^-1           | 0.5         | 2^51          | 2.2518E+15
 2^-5           | 0.03125     | 2^47          | 1.40737E+14
 2^-10          | 0.000976563 | 2^42          | 4.39805E+12
 2^-15          | 3.05176E-05 | 2^37          | 1.37439E+11
 2^-20          | 9.53674E-07 | 2^32          | 4294967296
 2^-25          | 2.98023E-08 | 2^27          | 134217728
 2^-30          | 9.31323E-10 | 2^22          | 4194304
 2^-35          | 2.91038E-11 | 2^17          | 131072
 2^-40          | 9.09495E-13 | 2^12          | 4096
 2^-45          | 2.84217E-14 | 2^7           | 128
 2^-50          | 8.88178E-16 | 2^2           | 4

In other words
If you want a Accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^52.
Any larger than this and the distance between floating point numbers is greater than 0.5.

If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be
is 2^42. Any larger than this and the distance between floating point numbers is greater than 0.0005.

I cannot really give a better answer than this. The user will need figure out what precision they want when performing the necessary calculation and their unit value (Meters, Feet, Inches, mm, cm). For the vast majority of cases float will suffice for simple simulations depending on the scale of the world you’re aiming to simulate.

Though it is something to be said, if you’re only aiming to simulate a 100 meter by 100 meter world you’re going to have somewhere in the order of accuracy near 2^-45. This is not even going into how modern FPU inside cpu’s will do calculations outside of the native type size and only after the calculation is complete they will round (depending on the FPU rounding mode) to the native type size.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 01:22

Thẻ: floating-point, numeric-precision

Thiết kế website giá rẻ

Danh mục

Solutions for floating point rounding errors