software solutions
Computer Science » Floating Point
For example, IEEE 754 with 32-bits: [1xSign Bit][8xExponent bits][23xBits mantissa] 127 is taken away from the unsigned value of the exponent, so a stored value of 1 is interpreted as -126 Value = sign × 2e × mantissa Where e is the stored value - 127. Emax and Emin represent the maximum and minimum values of e (the exponent) respectively. In IEE754 emax is 127 and emin is -126. The mantissa's leftmost digit mustn't be zero, ensuring this is called normalization. By doing this you don't need to record the position of a point (unlike with a decimal point in standard decimal system). Denormalised numbers represent very small numbers. When a number is normalized, its leftmost mantissa bit is always going to be 1, so it doesn't have to be stored. This is known as the hidden bit. NaN's (Not a Number) are used when an arithmetic result cannot be returned eg; the square root of a negative number or the result of invalid input. It is generally more useful to say Π = Π* + Error than to say Π ≈ Π*. Given a value a and an approximation of it, b, the absolute error is: |a-b| ie the difference between the actual and the approximation value. The relative error is the absolute error divided by the actual value. Let the true value of a quantity be where What are the absolute and relative errors of the approximation 3.14 to the value π? Eabs = |3.14 - π| ≈ 0.0016 The error we get by using finite arithmetic during a computation, as it can't represent all reals. The error we get by stopping an infinitary process after a finite point. Gradual loss of significance is when an error propagates, so whilst say 16 significant figures is stored, in fact say 8 bits may be correctly/accurately stored. Machine epsilon represents the "distance" between each sequential number in floating point. It is defined in ISO C as the difference between 1 and the smallest representable number greater than 1, i.e. 2^-23 in single precision and 2^-53 in double. The rounding error is mach_eps/h Rather than using one floating point number, this uses two: one for a max and one for a min value. Adaptive precision.IEEE Floating point
Definitions
Type
Exponent
Mantissa
Zeroes
0
0
Denormalized numbers
0
non zero
Normalized numbers
1 to 2e − 2
any
Infinities
2e − 1
0
NaNs
2e − 1
non zero
Errors
Absolute Error
Relative Error
and the measured or inferred value
. Then the relative error is defined by
is the absolute error. The relative error of the quotient or product of a number of quantities is less than or equal to the sum of their relative errors. The percentage error is 100% times the relative error.Example
Erel = |3.14 - π|/|π| ≈ 0.00051Rounding Errors
Truncation Errors
Loss of Significance
Machine Epsilon
Taylor Errors
![]()
Interval Arithmetic
Arbitrary Precision Floating Points