Floating-Point Numbers
Goals
Understand the structure of IEEE 754 floating-point numbers (sign, exponent, significand), the differences between normalized numbers, subnormals, and special values, and learn about machine epsilon and rounding modes.
Prerequisites
- Basics of binary numbers (bits, bytes)
- The concept of exponential notation
Contents
§1 Representing Numbers in Computers
Computers represent numbers using a finite number of bits. Floating-point numbers are the standard way to approximately represent real numbers.
Floating-Point Representation
A number $x$ is expressed in the form:
$$x = \pm m \times b^e$$where:
- $m$: significand (mantissa)
- $b$: base (radix), usually 2
- $e$: exponent
Example: Floating-point in decimal
$12345 = 1.2345 \times 10^4$
$0.00678 = 6.78 \times 10^{-3}$
The decimal point "floats" to express numbers of vastly different magnitudes.
§2 The IEEE 754 Standard
IEEE 754 is the international standard for floating-point representation and arithmetic.
Main Formats
| Format | Bits | Sign | Exponent | Significand | Decimal digits |
|---|---|---|---|---|---|
| Single precision (float) | 32 | 1 | 8 | 23 | ~7 |
| Double precision (double) | 64 | 1 | 11 | 52 | ~16 |
§3 Normalized Numbers
A normalized number has its significand adjusted so that the leading bit is always 1.
Normalized Representation (double precision)
$$x = (-1)^S \times 1.M \times 2^{E-1023}$$The leading "1." is implicit (the hidden bit) and is not stored. This saves one bit, effectively gaining an extra bit of precision.
The implicit "1." applies only when the exponent field E is 1 or greater (normalized numbers). When E = 0, the implicit bit switches to "0.", representing subnormal numbers and zero (§4).
Example: Representing $-6.5$
$-6.5 = -1.1010_2 \times 2^2$
- Sign S = 1 (negative)
- Exponent E = 2 + 1023 = 1025 = $10000000001_2$
- Significand M = $1010000\ldots0_2$ (the part after "1.")
Double-Precision Range
- Smallest normalized number: $\approx 2.2 \times 10^{-308}$
- Largest value: $\approx 1.8 \times 10^{308}$
§4 Special Values
IEEE 754 defines several special values based on the combination of the exponent E and the significand M. The implicit leading bit (hidden bit) changes depending on E.
| Exponent E | Significand M | Hidden bit | Interpretation | Value (double precision) |
|---|---|---|---|---|
| 0 | 0 | — | Zero | $\pm 0$ |
| 0 | $\neq 0$ | 0. | Subnormal | $(-1)^S \times 0.M \times 2^{-1022}$ |
| 1–2046 | any | 1. | Normalized | $(-1)^S \times 1.M \times 2^{E-1023}$ |
| 2047 | 0 | — | Infinity | $\pm\infty$ |
| 2047 | $\neq 0$ | — | NaN | undefined |
Infinity
Exponent all 1s, significand all 0s. $1/0 = +\infty$, $-1/0 = -\infty$.
NaN (Not a Number)
Exponent all 1s, significand nonzero. Results from $0/0$, $\infty - \infty$, etc.
Subnormal Numbers
Exponent all 0s, significand nonzero. Represent very small numbers close to zero.
Zero
Exponent and significand all 0s. Two variants: $+0$ and $-0$.
§5 Machine Epsilon
Machine Epsilon $\varepsilon_{\mathrm{mach}}$
The smallest positive floating-point number $\varepsilon$ such that $1 + \varepsilon \neq 1$.
It represents the "relative precision" of floating-point numbers.
Values
- Single precision: $\varepsilon_{\mathrm{mach}} \approx 1.2 \times 10^{-7}$ ($2^{-23}$)
- Double precision: $\varepsilon_{\mathrm{mach}} \approx 2.2 \times 10^{-16}$ ($2^{-52}$)
Meaning of Machine Epsilon
When any real number $x$ is represented as a floating-point number $\mathrm{fl}(x)$:
$$\frac{|\mathrm{fl}(x) - x|}{|x|} \leq \frac{\varepsilon_{\mathrm{mach}}}{2}$$That is, the relative error is at most $\varepsilon_{\mathrm{mach}}/2$.
§6 Rounding Modes
IEEE 754 defines five rounding modes:
- Round to nearest, ties to even (roundTiesToEven, default): round to the nearest representable value; break ties by choosing the value whose significand is even
- Round to nearest, ties away from zero (roundTiesToAway): round to the nearest; break ties by choosing the value farther from zero
- Round toward zero (roundTowardZero, truncation)
- Round toward $+\infty$ (roundTowardPositive, ceiling)
- Round toward $-\infty$ (roundTowardNegative, floor)
Quick Reference: When to Use Each Rounding Mode
| Mode | Good for | Not recommended for |
|---|---|---|
| TiesToEven | General scientific computing (default). Minimizes statistical bias | When directional guarantees are needed |
| TiesToAway | Finance/accounting (matches human "round half up" intuition) | Large-scale summation (slight upward bias) |
| TowardZero | Integer conversion (C's (int)x), sign-symmetric truncation |
Precision-sensitive iterative computation |
| Toward $+\infty$ | Upper bound in interval arithmetic | Standalone computation (always biased upward) |
| Toward $-\infty$ | Lower bound in interval arithmetic | Standalone computation (always biased downward) |
In interval arithmetic, Toward $+\infty$ and Toward $-\infty$ are used together to compute an interval $[\text{lower}, \text{upper}]$ that is guaranteed to contain the exact result.
§7 Summary
- Floating-point numbers: approximate real numbers in the form $\pm m \times 2^e$
- IEEE 754: international standard. Single precision (32-bit), double precision (64-bit)
- Normalized numbers: implicit leading "1." (hidden bit); switches to "0." for subnormals
- Special values: $\pm\infty$, NaN, $\pm 0$, subnormals
- Machine epsilon: smallest $\varepsilon$ such that $1 + \varepsilon \neq 1$
- Single precision: ~7 decimal digits; double precision: ~16 decimal digits