Floating Point Arithmetic#

In the following we learn how computers represent floating point numbers. Consider for example the number \(1/3 = 0.33333...\). Computers only have a finite amount of memory. We could try to work directly with fractional expressions such as \(1/3\) and keep track of nominators and denominators. But it turns out that the number of digits on both nominator and denominator tends to grow enormously as a computation progresses, making this approach infeasible.

On the other hand, if we want to store the sequence of digits \(0.33333...\) we have to cut-off at some point as computers only have a finite amount of memory. Hence, we are introducing a small error. This cut-off and the resulting error are rigorously defined in the IEEE 754 standard, which describes how to represent floating point numbers on modern computers. This allows us to prove rigorous theorems that take into account that computers only support finite precision arithmetic.

The two most commonly types for floating point numbers are IEEE double precision and IEEE single precision. The former gives us around 15 digits of accuracy. The latter around 7 digits of accuracy. For almost all purposes this is more than enough. Take for example the gravitational constant \(G\). It is known to around \(4\) digits of accuracy, only a fraction of the number of digits we have available in modern computers.

For most numerical calculations double precision is preferred. The reason is that in some computations errors might accumulate, leading to loss of accuracy. In double precision we have much more headroom for this than in single precision.

However, not all applications need such precision. For example, many machine learning applications use the half-precision type that is even less accurate than single precision but can be evaluated extremely efficiently on dedicated hardware (e.g. tensore cores on modern machine learning accelerators).

Floating point types in Python#

The Numpy module defines convenient ways to query properties of floating point numbers.

import numpy as np # import the numpy extension module
                   # and call it np as short form

The data type names in Numpy for floating point types are:

  • IEEE double precision: np.float, np.double, np.float64

  • IEEE single precision: np.single, np.float32

Let us query some properties of these numbers

double_precision_info = np.finfo(np.float64)

The biggest and smallest (by absolute value) normalized floating point numbers are

double_precision_info.max
1.7976931348623157e+308
double_precision_info.tiny
2.2250738585072014e-308

Floating point numbers can not be arbitrarily close to each other. There is a smallest relative distance, which we define shortly. It is given as

double_precision_info.eps
2.220446049250313e-16

This leads to a limited precision of floating point numbers. The approximate relative precision is

double_precision_info.precision
15

Task: What are the values for single precision arithmetic?#

single_precision_info = np.finfo(np.float32)
print([single_precision_info.max,
       single_precision_info.tiny,
       single_precision_info.eps,
       single_precision_info.precision])
[3.4028235e+38, 1.1754944e-38, 1.1920929e-07, 6]

Definition of floating point numbers#

We use the following model for floating point numbers. For more details, see e.g. the recommended lecture book by Trefethen and Bau.

The set of floating point numbers is defined as follows:

\[ \mathcal{F} = \left\{(-1)^s\cdot b^e \cdot \frac{m}{b^{p-1}} :\right. \left. s = 0,1; e_{min}\leq e \leq e_{max}; b^{p-1}\leq m\leq b^{p}-1\right\}. \]
  • IEEE double precision: \(e_{min} = -1022, e_{max} = 1023, p = 53\)

  • IEEE single precision: \(e_{min} = -126, e_{max} = 127, p = 24\)

Typically, b = 2. For the mantissa we have:

\[ \frac{m}{b^{p-1}} = 1, 1+b^{1-p}, 1+2b^{1-p}, \dots, b-b^{1-p} \]

\(\rightarrow\) Distance of neighbouring floats is \(2^e b^{1-p}\).

\(\epsilon_{rel} = b^{1-p}\) is smallest number such that $\( 1 + \epsilon_{rel} \neq 1. \)$

1 + double_precision_info.eps
1.0000000000000002
1 + .25 * double_precision_info.eps
1.0

Approximating numbers in floating point arithmetic#

Let \(x\in\mathbb{R}\), \(b^{e_{min}}\leq x < b^{e_{max}+1}\). Define \(\epsilon_{mach}:=\frac{1}{2}b^{1-p}\)

There exists \(x'\in \mathcal{F}\) such that \(|x-x'|\leq\epsilon_{mach}|x|\).

\(\epsilon_{mach}\) is relative distance to the next floating point number in \(\mathcal{F}\).

Define the projection

\[ fl:fl(x)\rightarrow x', \]

where \(x'\) is the closest floating point number in \(\mathcal{F}\).

It follows that \(fl(x)=x*(1+\epsilon)\) for some \(|\epsilon|\leq \epsilon_{mach}\).

Fundamental Axiom of Floating Point Arithmetic#

Define \(x\odot y = fl(x \cdot y)\), where \(\cdot\) is one of \(+,-,\times,\div\). Then for all \(x,y\in\mathcal{F}\) there exists \(\epsilon\) with \(|\epsilon| \leq \epsilon_{mach}\) such that

\[ x\odot y = (x \cdot y)(1+\epsilon). \]

Most modern architectures guarantee this property.

Special symbols in floating point arithmetic#

In addition to numbers several other important symbols are defined in the floating point standard. The most important are:

  • NaN: Not a number

  • \(\pm\) inf

import numpy as np
a = np.inf
b = np.float64(0) / np.float64(0)
print(b)
nan
/tmp/ipykernel_7918/3058898986.py:3: RuntimeWarning: invalid value encountered in double_scalars
  b = np.float64(0) / np.float64(0)

Note that above we have explicitly converted the zeros in the division \(0/0\) to use the corresponding Numpy type. The reason is that Python itself is not fully IEEE 754 compliant. According to the standard the division \(0 / 0\) is valid and has nan as result. But let’s consider what Python is doing.

import numpy as np
a = np.inf
b = 0. / 0.
print(b)
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
/tmp/ipykernel_7918/617932585.py in <module>
      1 import numpy as np
      2 a = np.inf
----> 3 b = 0. / 0.
      4 print(b)

ZeroDivisionError: float division by zero

Instead of the result nan Python returns a runtime error. This may be useful for certain applications but not for numerics and is not IEEE 754 compliant. This is one of many reasons why for numerical computations the Numpy module is so important for Python.