IEEE 754 Floating Point
Abstract
The IEEE Standard for Binary Floating-Point Arithmetic is the most widely used floating-point implementation.
32-bit Floating Point
The format of a floating point number is stored as:
[1xSign Bit][7xExponent bits][23xBits mantissa]
The exponent stored with 127 added to it, so a stored value of 1 is interpreted as -126.
The actual value of the floating point is found through the formula:
Value = Sign × 2e × Mantissa
Where e is the stored value - 127.
Cases
| Type |
Exponent |
Mantissa |
| Zeroes |
0 |
0 |
| Denormalized numbers |
0 |
non zero |
| Normalized numbers |
1 to 2e ? 2 |
any |
| Infinities |
2e ? 1 |
0 |
| NaNs |
2e ? 1 |
non zero |
Denormal Numbers
Denormalized numbers are the same as normal floating point numbers except that e = ?126 and m is 0.Fraction. (e is NOT ?127 : The significand has to be shifted to the right by one more bit, in order to include the leading bit, which is not always 1 in this case. This is balanced by incrementing the exponent to -126 for the calculation.)
An example of a denormal number in 32-bit IEEE754 is 0000 0000 100 0000 0000 0000 0000 0000, which in decimal is 5.9×10-39.
FLT_MIN from gives 1.17549435e-38f as the smallest number that can be expressed in 32-bit floating point.
Revision
IEEE 754r is an ongoing revision to the IEEE 754 floating point standard.
The intent of the revision is to extend the standard where it has become necessary, to tighten up certain areas of the original standard which were left undefined, and to merge in IEEE 854 (the radix-independent floating-point standard).
The round-to-nearest rounding method has been added for decimal operations, as have min and max operations. The final draft is available here.