Floating Point

IEEE Floating point

 

For example, IEEE 754 with 32-bits:

[1xSign Bit][8xExponent bits][23xBits mantissa]

 

127 is taken away from the unsigned value of the exponent, so a stored value of 1 is interpreted as -126

Value = sign × 2e × mantissa

Where e is the stored value - 127.

 

Definitions

Emax and Emin represent the maximum and minimum values of e (the exponent) respectively. In IEE754 emax is 127 and emin is -126.

 

The mantissa's leftmost digit mustn't be zero, ensuring this is called normalization. By doing this you don't need to record the position of a point (unlike with a decimal point in standard decimal system).

 

Denormalised numbers represent very small numbers.

 

When a number is normalized, its leftmost mantissa bit is always going to be 1, so it doesn't have to be stored. This is known as the hidden bit.

 

NaN's (Not a Number) are used when an arithmetic result cannot be returned eg; the square root of a negative number or the result of invalid input.

 

 

Type Exponent Mantissa
Zeroes 0 0
Denormalized numbers 0 non zero
Normalized numbers 1 to 2e − 2 any
Infinities 2e − 1 0
NaNs 2e − 1 non zero

Errors

It is generally more useful to say Π = Π* + Error than to say Π ≈ Π*.

Absolute Error

Given a value a and an approximation of it, b, the absolute error is:

|a-b| ie the difference between the actual and the approximation value.

Relative Error

The relative error is the absolute error divided by the actual value.

Let the true value of a quantity be  and the measured or inferred value . Then the relative error is defined by

 

 

 

where  is the absolute error. The relative error of the quotient or product of a number of quantities is less than or equal to the sum of their relative errors. The percentage error is 100% times the relative error.

Example

What are the absolute and relative errors of the approximation 3.14 to the value π?

Eabs = |3.14 - π| ≈ 0.0016
Erel = |3.14 - π|/|π| ≈ 0.00051

 

Rounding Errors

The error we get by using finite arithmetic during a computation, as it can't represent all reals.

Truncation Errors

The error we get by stopping an infinitary process after a finite point.

Loss of Significance

Gradual loss of significance is when an error propagates, so whilst say 16 significant figures is stored, in fact say 8 bits may be correctly/accurately stored.

 

Machine Epsilon

Machine epsilon represents the "distance" between each sequential number in floating point. It is defined in ISO C as the difference between 1 and the smallest representable number greater than 1, i.e. 2^-23 in single precision and 2^-53 in double.

 

Taylor Errors

The rounding error is mach_eps/h

 

Interval Arithmetic

Rather than using one floating point number, this uses two: one for a max and one for a min value.

Arbitrary Precision Floating Points

Adaptive precision.