What is a floating-point number?

A floating-point number is a format for approximately representing real numbers in a computer, expressed as +/-m x 2^e. Standardized by IEEE 754, single precision (32-bit, ~7 digits) and double precision (64-bit, ~16 digits) are widely used.

What is machine epsilon?

Machine epsilon is the smallest positive floating-point number eps such that 1 + eps != 1. It represents the relative precision of floating-point arithmetic. For double precision, eps ~ 2.2e-16 (2^{-52}). The relative error of rounding any real number to floating-point is at most eps/2.

NaN (Not a Number) is a special value defined in IEEE 754 representing the result of undefined operations such as 0/0 or infinity - infinity. It occurs when the exponent field is all 1s and the significand is nonzero. NaN returns false for any comparison, including comparison with itself.

Floating-Point Numbers

Goals

Understand the structure of IEEE 754 floating-point numbers (sign, exponent, significand), the differences between normalized numbers, subnormals, and special values, and learn about machine epsilon and rounding modes.

Prerequisites

Basics of binary numbers (bits, bytes)
The concept of exponential notation

Contents

§1 Representing Numbers in Computers
§2 The IEEE 754 Standard
§3 Normalized Numbers
§4 Special Values
§5 Machine Epsilon
§6 Rounding Modes
§7 Summary

§1 Representing Numbers in Computers

Computers represent numbers using a finite number of bits. Floating-point numbers are the standard way to approximately represent real numbers.

Floating-Point Representation

A number $x$ is expressed in the form:

$$x = \pm m \times b^e$$

where:

$m$: significand (mantissa)
$b$: base (radix), usually 2
$e$: exponent

Example: Floating-point in decimal

$12345 = 1.2345 \times 10^4$

$0.00678 = 6.78 \times 10^{-3}$

The decimal point "floats" to express numbers of vastly different magnitudes.

§2 The IEEE 754 Standard

IEEE 754 is the international standard for floating-point representation and arithmetic.

Main Formats

Format	Bits	Sign	Exponent	Significand	Decimal digits
Single precision (float)	32	1	8	23	~7
Double precision (double)	64	1	11	52	~16

Figure 1. Single-precision (float, 32-bit) bit layout. Bias = 127. The formula applies to normalized numbers (§3).

Figure 2. Double-precision (double, 64-bit) bit layout. Bias = 1023. The wider significand provides ~16 decimal digits of precision. The formula applies to normalized numbers (§3).

§3 Normalized Numbers

A normalized number has its significand adjusted so that the leading bit is always 1.

Normalized Representation (double precision)

$$x = (-1)^S \times 1.M \times 2^{E-1023}$$

The leading "1." is implicit (the hidden bit) and is not stored. This saves one bit, effectively gaining an extra bit of precision.

The implicit "1." applies only when the exponent field E is 1 or greater (normalized numbers). When E = 0, the implicit bit switches to "0.", representing subnormal numbers and zero (§4).

Example: Representing $-6.5$

$-6.5 = -1.1010_2 \times 2^2$

Sign S = 1 (negative)
Exponent E = 2 + 1023 = 1025 = $10000000001_2$
Significand M = $1010000\ldots0_2$ (the part after "1.")

Double-Precision Range

Smallest normalized number: $\approx 2.2 \times 10^{-308}$
Largest value: $\approx 1.8 \times 10^{308}$

§4 Special Values

IEEE 754 defines several special values based on the combination of the exponent E and the significand M. The implicit leading bit (hidden bit) changes depending on E.

Exponent E	Significand M	Hidden bit	Interpretation	Value (double precision)
0	0	—	Zero	$\pm 0$
0	$\neq 0$	0.	Subnormal	$(-1)^S \times 0.M \times 2^{-1022}$
1–2046	any	1.	Normalized	$(-1)^S \times 1.M \times 2^{E-1023}$
2047	0	—	Infinity	$\pm\infty$
2047	$\neq 0$	—	NaN	undefined

Infinity

Exponent all 1s, significand all 0s. $1/0 = +\infty$, $-1/0 = -\infty$.

NaN (Not a Number)

Exponent all 1s, significand nonzero. Results from $0/0$, $\infty - \infty$, etc.

Subnormal Numbers

Exponent all 0s, significand nonzero. Represent very small numbers close to zero.

Zero

Exponent and significand all 0s. Two variants: $+0$ and $-0$.

§5 Machine Epsilon

Machine Epsilon $\varepsilon_{\mathrm{mach}}$

The smallest positive floating-point number $\varepsilon$ such that $1 + \varepsilon \neq 1$.

It represents the "relative precision" of floating-point numbers.

Values

Single precision: $\varepsilon_{\mathrm{mach}} \approx 1.2 \times 10^{-7}$ ($2^{-23}$)
Double precision: $\varepsilon_{\mathrm{mach}} \approx 2.2 \times 10^{-16}$ ($2^{-52}$)

Meaning of Machine Epsilon

When any real number $x$ is represented as a floating-point number $\mathrm{fl}(x)$:

$$\frac{|\mathrm{fl}(x) - x|}{|x|} \leq \frac{\varepsilon_{\mathrm{mach}}}{2}$$

That is, the relative error is at most $\varepsilon_{\mathrm{mach}}/2$.

§6 Rounding Modes

IEEE 754 defines five rounding modes:

Round to nearest, ties to even (roundTiesToEven, default): round to the nearest representable value; break ties by choosing the value whose significand is even
Round to nearest, ties away from zero (roundTiesToAway): round to the nearest; break ties by choosing the value farther from zero
Round toward zero (roundTowardZero, truncation)
Round toward $+\infty$ (roundTowardPositive, ceiling)
Round toward $-\infty$ (roundTowardNegative, floor)

Figure 3. Five rounding modes (positive numbers). x₁ (near a) and x₂ (near b) show both cases. Modes 1, 2 always round to nearest; 3, 5 always round toward a (toward 0 / −∞); 4 always rounds toward b (+∞). Only at the midpoint do modes 1 and 2 differ.

Figure 4. Five rounding modes (negative numbers). Compared to the positive case, modes 3 (TowardZero) and 4 (Toward +∞) now both point toward b (closer to zero), while mode 5 (Toward −∞) points toward a (larger absolute value).

Quick Reference: When to Use Each Rounding Mode

Mode	Good for	Not recommended for
TiesToEven	General scientific computing (default). Minimizes statistical bias	When directional guarantees are needed
TiesToAway	Finance/accounting (matches human "round half up" intuition)	Large-scale summation (slight upward bias)
TowardZero	Integer conversion (C's `(int)x`), sign-symmetric truncation	Precision-sensitive iterative computation
Toward $+\infty$	Upper bound in interval arithmetic	Standalone computation (always biased upward)
Toward $-\infty$	Lower bound in interval arithmetic	Standalone computation (always biased downward)

In interval arithmetic, Toward $+\infty$ and Toward $-\infty$ are used together to compute an interval $[\text{lower}, \text{upper}]$ that is guaranteed to contain the exact result.

§7 Summary

Floating-point numbers: approximate real numbers in the form $\pm m \times 2^e$
IEEE 754: international standard. Single precision (32-bit), double precision (64-bit)
Normalized numbers: implicit leading "1." (hidden bit); switches to "0." for subnormals
Special values: $\pm\infty$, NaN, $\pm 0$, subnormals
Machine epsilon: smallest $\varepsilon$ such that $1 + \varepsilon \neq 1$
Single precision: ~7 decimal digits; double precision: ~16 decimal digits