IEEE 754 Floating-Point Standard

Goal

Understand the IEEE 754 floating-point representation (sign, exponent, significand), single/double precision bit layouts, special values ($\pm\infty$, NaN, $\pm 0$), normalized vs. denormalized numbers, and rounding rules.

Prerequisites

  • Binary number basics (bits, bytes)
  • Basic concept of floating-point numbers
Table of Contents

1. Overview

IEEE 754 (formally: IEEE Standard for Floating-Point Arithmetic) is the international standard for floating-point arithmetic. First published in 1985 (IEEE 754-1985), revised in 2008 (IEEE 754-2008), and latest edition in 2019 (IEEE 754-2019).

Virtually all modern processors (CPUs, GPUs) comply with this standard. Programming language types like float (single precision) and double (double precision) are based on this standard.

What IEEE 754 Specifies

  • Number representation format (sign, exponent, significand bit layout)
  • Special values ($\pm\infty$, NaN, $\pm 0$)
  • Rounding rules (5 rounding modes)
  • Correct rounding guarantee for basic operations ($+, -, \times, \div, \sqrt{\phantom{x}}$)
  • Exception handling (invalid operation, division by zero, overflow, underflow, inexact)

2. Number Representation

An IEEE 754 floating-point number consists of three components:

$$(-1)^s \times 2^{e-\text{bias}} \times (1 + f)$$
  • $s$ (sign): 0 for positive, 1 for negative (1 bit)
  • $e$ (exponent): biased exponent; $e = 0$ and $e = e_{\max}$ are reserved for special values
  • $f$ (significand / fraction): $0 \le f < 1$; combined with the implicit leading 1 to form $1.f$
Double Precision (64-bit) Layout S 1 bit Exponent (E) 11 bits (bias=1023) Significand (F) 52 bits (+implicit 1 = 53 bits) 63 62 52 51 0
Figure 1. Double precision bit layout: 1-bit sign, 11-bit exponent (bias 1023), 52-bit significand.

3. Single and Double Precision

ParameterSingle (float32)Double (float64)
Total bits3264
Sign1 bit1 bit
Exponent8 bits11 bits
Significand23 bits (+1 = 24)52 bits (+1 = 53)
Bias1271023
Decimal digits~7~15--16
Machine epsilon$2^{-23} \approx 1.19 \times 10^{-7}$$2^{-52} \approx 2.22 \times 10^{-16}$
Maximum value$\approx 3.4 \times 10^{38}$$\approx 1.8 \times 10^{308}$
Min normalized$\approx 1.2 \times 10^{-38}$$\approx 2.2 \times 10^{-308}$

IEEE 754-2008 also defines half precision (16-bit) and quadruple precision (128-bit).

4. Special Values

Infinity ($\pm\infty$)

Exponent all 1s, significand all 0s. $1/0 = +\infty$, $-1/0 = -\infty$, $e^{710} = +\infty$ (overflow).

NaN (Not a Number)

Exponent all 1s, significand nonzero. Results from $0/0$, $\infty - \infty$, $\sqrt{-1}$, etc.

Key property: NaN returns false for every comparison, including with itself.

x = 0.0 / 0.0  // NaN
x == x         // false (can detect NaN this way)
x != x         // true

There are two kinds: quiet NaN (qNaN: propagates without raising an exception) and signaling NaN (sNaN: raises an exception when used).

Signed Zero ($\pm 0$)

$+0$ and $-0$ compare as equal ($+0 == -0$ is true), but $1/(+0) = +\infty$ while $1/(-0) = -\infty$.

5. Normalized and Denormalized Numbers

Normalized Numbers

When the exponent field $e$ satisfies $1 \le e \le e_{\max} - 1$. The implicit leading 1 provides the full $p$ bits of precision (53 bits for double).

Denormalized Numbers (Subnormal Numbers)

When the exponent field $e = 0$. The implicit leading bit is 0, and the value is interpreted as:

$$(-1)^s \times 2^{1-\text{bias}} \times (0 + f)$$

Denormalized numbers fill the gap between the smallest normalized number and zero, enabling gradual underflow. They have fewer significant digits than normalized numbers but guarantee the important property that $x - y = 0 \Rightarrow x = y$.

6. Rounding Rules

IEEE 754 guarantees that basic operations ($+, -, \times, \div, \sqrt{\phantom{x}}$) are correctly rounded: the result is identical to performing the operation with infinite precision and then rounding. The default mode is round to nearest, ties to even.

7. Relationship with Machine Epsilon

Machine epsilon $\varepsilon_{\text{mach}}$ is defined as the smallest floating-point number such that $1 + \varepsilon > 1$. In IEEE 754, with $p$ significand bits:

$$\varepsilon_{\text{mach}} = 2^{-(p-1)}$$

($p = 53$ gives $\varepsilon_{\text{mach}} = 2^{-52} \approx 2.22 \times 10^{-16}$).

For any floating-point operation $\text{fl}(a \circ b) = (a \circ b)(1 + \delta)$, it is guaranteed that $|\delta| \le \varepsilon_{\text{mach}}$. This property is the foundation of numerical stability analysis.

8. Frequently Asked Questions

Q1. What is IEEE 754?

The international standard for floating-point arithmetic, specifying number representation, rounding rules, special values, and exception handling. First published in 1985, virtually all modern processors comply with it.

Q2. How many significant digits does double precision have?

The 52-bit significand (53 with the implicit leading 1) provides about 15--16 decimal significant digits. Machine epsilon is $2^{-52} \approx 2.22 \times 10^{-16}$.

Q3. What is NaN?

A special value representing the result of undefined operations such as $0/0$, $\infty - \infty$, or $\sqrt{-1}$. NaN returns false for every comparison, including with itself ($\text{NaN} \neq \text{NaN}$ is true).

9. References

  • Wikipedia, "IEEE 754"
  • Wikipedia, "Double-precision floating-point format"
  • D. Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic," ACM Computing Surveys, 23(1), 1991.
  • IEEE 754-2019, IEEE Standard for Floating-Point Arithmetic, IEEE, 2019.
  • M. L. Overton, Numerical Computing with IEEE Floating Point Arithmetic, SIAM, 2001.