Floating-Point Numbers

Goals

Understand the structure of IEEE 754 floating-point numbers (sign, exponent, significand), the differences between normalized numbers, subnormals, and special values, and learn about machine epsilon and rounding modes.

Prerequisites

  • Basics of binary numbers (bits, bytes)
  • The concept of exponential notation
Contents

§1 Representing Numbers in Computers

Computers represent numbers using a finite number of bits. Floating-point numbers are the standard way to approximately represent real numbers.

Floating-Point Representation

A number $x$ is expressed in the form:

$$x = \pm m \times b^e$$

where:

  • $m$: significand (mantissa)
  • $b$: base (radix), usually 2
  • $e$: exponent

Example: Floating-point in decimal

$12345 = 1.2345 \times 10^4$

$0.00678 = 6.78 \times 10^{-3}$

The decimal point "floats" to express numbers of vastly different magnitudes.

§2 The IEEE 754 Standard

IEEE 754 is the international standard for floating-point representation and arithmetic.

Main Formats

FormatBitsSignExponentSignificandDecimal digits
Single precision (float)321823~7
Double precision (double)6411152~16
Single Precision (32-bit) Sign S 1 bit Exponent E 8 bits Significand M 23 bits Normalized: value = (-1)S × 1.M × 2(E−127)
Figure 1. Single-precision (float, 32-bit) bit layout. Bias = 127. The formula applies to normalized numbers (§3).
Double Precision (64-bit) Sign S 1 bit Exponent E 11 bits Significand M 52 bits Normalized: value = (-1)S × 1.M × 2(E−1023)
Figure 2. Double-precision (double, 64-bit) bit layout. Bias = 1023. The wider significand provides ~16 decimal digits of precision. The formula applies to normalized numbers (§3).

§3 Normalized Numbers

A normalized number has its significand adjusted so that the leading bit is always 1.

Normalized Representation (double precision)

$$x = (-1)^S \times 1.M \times 2^{E-1023}$$

The leading "1." is implicit (the hidden bit) and is not stored. This saves one bit, effectively gaining an extra bit of precision.

The implicit "1." applies only when the exponent field E is 1 or greater (normalized numbers). When E = 0, the implicit bit switches to "0.", representing subnormal numbers and zero (§4).

Example: Representing $-6.5$

$-6.5 = -1.1010_2 \times 2^2$

  • Sign S = 1 (negative)
  • Exponent E = 2 + 1023 = 1025 = $10000000001_2$
  • Significand M = $1010000\ldots0_2$ (the part after "1.")

Double-Precision Range

  • Smallest normalized number: $\approx 2.2 \times 10^{-308}$
  • Largest value: $\approx 1.8 \times 10^{308}$

§4 Special Values

IEEE 754 defines several special values based on the combination of the exponent E and the significand M. The implicit leading bit (hidden bit) changes depending on E.

Exponent ESignificand MHidden bitInterpretationValue (double precision)
00Zero$\pm 0$
0$\neq 0$0.Subnormal$(-1)^S \times 0.M \times 2^{-1022}$
1–2046any1.Normalized$(-1)^S \times 1.M \times 2^{E-1023}$
20470Infinity$\pm\infty$
2047$\neq 0$NaNundefined

Infinity

Exponent all 1s, significand all 0s. $1/0 = +\infty$, $-1/0 = -\infty$.

NaN (Not a Number)

Exponent all 1s, significand nonzero. Results from $0/0$, $\infty - \infty$, etc.

Subnormal Numbers

Exponent all 0s, significand nonzero. Represent very small numbers close to zero.

Zero

Exponent and significand all 0s. Two variants: $+0$ and $-0$.

§5 Machine Epsilon

Machine Epsilon $\varepsilon_{\mathrm{mach}}$

The smallest positive floating-point number $\varepsilon$ such that $1 + \varepsilon \neq 1$.

It represents the "relative precision" of floating-point numbers.

Values

  • Single precision: $\varepsilon_{\mathrm{mach}} \approx 1.2 \times 10^{-7}$ ($2^{-23}$)
  • Double precision: $\varepsilon_{\mathrm{mach}} \approx 2.2 \times 10^{-16}$ ($2^{-52}$)

Meaning of Machine Epsilon

When any real number $x$ is represented as a floating-point number $\mathrm{fl}(x)$:

$$\frac{|\mathrm{fl}(x) - x|}{|x|} \leq \frac{\varepsilon_{\mathrm{mach}}}{2}$$

That is, the relative error is at most $\varepsilon_{\mathrm{mach}}/2$.

§6 Rounding Modes

IEEE 754 defines five rounding modes:

  • Round to nearest, ties to even (roundTiesToEven, default): round to the nearest representable value; break ties by choosing the value whose significand is even
  • Round to nearest, ties away from zero (roundTiesToAway): round to the nearest; break ties by choosing the value farther from zero
  • Round toward zero (roundTowardZero, truncation)
  • Round toward $+\infty$ (roundTowardPositive, ceiling)
  • Round toward $-\infty$ (roundTowardNegative, floor)
Rounding positive numbers (a < x < b, adjacent FP numbers) 0 a (even signif.) midpoint b (odd signif.) x₁ x₂ x₁ (near a) x₂ (near b) 1 TiesToEven (default) midpoint→a (even signif.) 2 TiesToAway midpoint→b (away from 0) 3 TowardZero (truncation) always → a 4 Toward +∞ (ceiling) always → b 5 Toward −∞ (floor) always → a
Figure 3. Five rounding modes (positive numbers). x₁ (near a) and x₂ (near b) show both cases. Modes 1, 2 always round to nearest; 3, 5 always round toward a (toward 0 / −∞); 4 always rounds toward b (+∞). Only at the midpoint do modes 1 and 2 differ.
Rounding negative numbers (a < x < b, adjacent FP numbers) 0 a (odd signif.) midpoint b (even signif.) x₁ x₂ x₁ (near a) x₂ (near b) 1 TiesToEven (default) midpoint→b (even signif.) 2 TiesToAway midpoint→a (away from 0) 3 TowardZero (truncation) always → b (toward 0) 4 Toward +∞ (ceiling) always → b 5 Toward −∞ (floor) always → a
Figure 4. Five rounding modes (negative numbers). Compared to the positive case, modes 3 (TowardZero) and 4 (Toward +∞) now both point toward b (closer to zero), while mode 5 (Toward −∞) points toward a (larger absolute value).

Quick Reference: When to Use Each Rounding Mode

ModeGood forNot recommended for
TiesToEven General scientific computing (default). Minimizes statistical bias When directional guarantees are needed
TiesToAway Finance/accounting (matches human "round half up" intuition) Large-scale summation (slight upward bias)
TowardZero Integer conversion (C's (int)x), sign-symmetric truncation Precision-sensitive iterative computation
Toward $+\infty$ Upper bound in interval arithmetic Standalone computation (always biased upward)
Toward $-\infty$ Lower bound in interval arithmetic Standalone computation (always biased downward)

In interval arithmetic, Toward $+\infty$ and Toward $-\infty$ are used together to compute an interval $[\text{lower}, \text{upper}]$ that is guaranteed to contain the exact result.

§7 Summary

  • Floating-point numbers: approximate real numbers in the form $\pm m \times 2^e$
  • IEEE 754: international standard. Single precision (32-bit), double precision (64-bit)
  • Normalized numbers: implicit leading "1." (hidden bit); switches to "0." for subnormals
  • Special values: $\pm\infty$, NaN, $\pm 0$, subnormals
  • Machine epsilon: smallest $\varepsilon$ such that $1 + \varepsilon \neq 1$
  • Single precision: ~7 decimal digits; double precision: ~16 decimal digits