simple solutions for complex problems
Floating Point
For example, IEEE 754 with 32-bits:
[1xSign Bit][8xExponent bits][23xBits mantissa]
127 is taken away from the unsigned value of the exponent, so a stored value of 1 is interpreted as -126
Value = sign × 2e × mantissa
Where e is the stored value - 127.
Emax and Emin represent the maximum and minimum values of e (the exponent) respectively. In IEE754 emax is 127 and emin is -126.
The mantissa's leftmost digit mustn't be zero, ensuring this is called normalization. By doing this you don't need to record the position of a point (unlike with a decimal point in standard decimal system).
Denormalised numbers represent very small numbers.
When a number is normalized, its leftmost mantissa bit is always going to be 1, so it doesn't have to be stored. This is known as the hidden bit.
NaN's (Not a Number) are used when an arithmetic result cannot be returned eg; the square root of a negative number or the result of invalid input.
| Type | Exponent | Mantissa |
| Zeroes | 0 | 0 |
| Denormalized numbers | 0 | non zero |
| Normalized numbers | 1 to 2e − 2 | any |
| Infinities | 2e − 1 | 0 |
| NaNs | 2e − 1 | non zero |
It is generally more useful to say Π = Π* + Error than to say Π ≈ Π*.
Given a value a and an approximation of it, b, the absolute error is:
|a-b| ie the difference between the actual and the approximation value.
The relative error is the absolute error divided by the actual value.
Let the true value of a quantity be
and the measured or inferred value
. Then the relative error is defined by
|
where
is the absolute error. The relative error of the quotient or product of a number of quantities is less than or equal to the sum of their relative errors. The percentage error is 100% times the relative error.
What are the absolute and relative errors of the approximation 3.14 to the value π?
Eabs = |3.14 - π| ≈ 0.0016
Erel = |3.14 - π|/|π| ≈ 0.00051
The error we get by using finite arithmetic during a computation, as it can't represent all reals.
The error we get by stopping an infinitary process after a finite point.
Gradual loss of significance is when an error propagates, so whilst say 16 significant figures is stored, in fact say 8 bits may be correctly/accurately stored.
Machine epsilon represents the "distance" between each sequential number in floating point. It is defined in ISO C as the difference between 1 and the smallest representable number greater than 1, i.e. 2^-23 in single precision and 2^-53 in double.
The rounding error is mach_eps/h
![]()
Rather than using one floating point number, this uses two: one for a max and one for a min value.
Adaptive precision.