Chapter 3

Pinhole Camera Model

Pinhole Camera Model

What is a pinhole camera?

The pinhole camera is the simplest camera model: light enters through a small hole (the pinhole) and forms an image on a screen on the opposite side. Because no lens is used, it serves as an idealized geometric model widely used in CV and CG.

P(X,Y,Z) 3D point Image plane p(u,v) image point f (focal length) O (camera center) optical axis Z Y X Properties of perspective projection • Distant objects appear smaller • Parallel lines meet at a vanishing point • Depth information is lost • A 3D→2D projective transform
Figure 1. 3D-to-2D projection by the pinhole camera model.

Mathematics of perspective projection

The perspective projection of a 3D point $P = (X, Y, Z)$ to an image-plane point $p = (x, y)$ follows from the geometry of similar triangles:

Note on convention: Figure 1 shows the physical pinhole geometry (image plane behind the camera, at $-Z$). For the derivations that follow, we instead use the virtual image plane model (image plane in front of the camera, at $+Z$, i.e., $Z = f$) shown in Figure 2. The two models differ only in whether the image is flipped; the similar-triangle proportions are identical. The virtual image plane is the standard convention in CV/CG textbooks because it removes the sign flips and is easier to manipulate.

Side view (Y-Z plane) f P(Y, Z) p(y) O Z Y y Similar triangles
$\frac{y}{Y} = \frac{f}{Z}$
$y = f \cdot \frac{Y}{Z}$
(same for the X-Z plane)
Figure 2. Similar-triangle relation for perspective projection (virtual image plane model, Y-Z plane).

Perspective projection formulas

Projection of a 3D point $(X, Y, Z)$ onto the image-plane point $(x, y)$:

$$x = f \dfrac{X}{Z}, \quad y = f \dfrac{Y}{Z}$$

where $f$ is the focal length and $Z$ is the depth.

Important: Because the formulas divide by $Z$, perspective projection is not a linear transformation in the usual sense (it cannot be written as a single matrix product). However, if we use homogeneous coordinates ($(X, Y, Z) \to (X, Y, Z, 1)$, covered in the next chapter), the transformation can be split into (1) a part expressible as a matrix product and (2) a final division by $Z$ (normalization of the homogeneous coordinate). The non-linearity is confined to step (2), and step (1) lets us treat rotation, translation, and projection in a unified matrix framework.

Field of View (FOV)

The field of view (FOV) is the angular extent of the scene that the camera captures. It is determined by the focal length and the sensor size.

Definition: Field of View (FOV)

The maximum angle of the scene captured by the camera. Centered on the optical axis, $\theta$ is the angle between the two rays from the pinhole $O$ to the two ends of the image plane (sensor).

Depending on whether the width $w$, height $h$, or diagonal $d$ of the sensor is used, one obtains the horizontal, vertical, or diagonal field of view (different standards and use cases prefer different ones).

Figure 3 illustrates how the FOV changes when the focal length $f$ varies.

Wide (short f) small f large θ wide FOV Standard medium f medium θ close to human vision Telephoto (long f) large f small θ narrow FOV
Figure 3. Sensor height fixed; only the focal length $f$ varies. The shorter $f$ is (camera closer to sensor), the dramatically wider the FOV $\theta$ becomes (half angles 45°→24°→12°, a 4× range).

Deriving the FOV formula

The FOV follows from a simple right triangle. Because the optical axis is perpendicular to the sensor, the points $O$ (pinhole), the sensor center, and the top edge of the sensor form a right triangle.

f (focal length) Image plane (sensor) w/2 O (pinhole) θ/2 θ
Figure 4. FOV derivation geometry — from the right triangle $O$-(sensor center)-(top sensor edge) we read off the half-angle $\theta/2$.

In Figure 4 the adjacent leg of the right triangle is the focal length $f$ and the opposite leg is the sensor half-width $w/2$. For the half angle $\theta/2$ we therefore have

$$\tan\!\left(\dfrac{\theta}{2}\right) = \dfrac{w/2}{f} = \dfrac{w}{2f}.$$

Taking $\arctan$ on both sides and doubling gives the full FOV $\theta$.

FOV formula

$$\boxed{\;\theta = 2\,\arctan\!\left(\dfrac{w}{2f}\right)\;}$$

Substitute the sensor width for $w$ to get the horizontal FOV, the height for the vertical FOV, or the diagonal $d$ for the diagonal FOV.

Larger $f$ ⇒ smaller $\theta$ (telephoto); smaller $f$ ⇒ larger $\theta$ (wide angle).

Principal point and skew

In real cameras the optical axis does not always pass through the center of the image.

(c_x, c_y) principal point image center u v (0,0) Principal point Optical axis ∩ image plane Ideal: image center Reality: often offset
Figure 5. Principal point and the optical axis.

Conversion to pixel coordinates

From normalized image coordinates $(x, y)$ to pixel coordinates $(u, v)$:

$$u = f_x \cdot x + c_x, \quad v = f_y \cdot y + c_y$$
  • $f_x, f_y$: focal length in pixel units ($f_x = f / p_x$, where $p_x$ is the pixel size)
  • $(c_x, c_y)$: pixel coordinates of the principal point

Combined camera model

Combining perspective projection with the pixel coordinate transform:

$$\begin{pmatrix} u \\ v \\ 1 \end{pmatrix} = \dfrac{1}{Z} \begin{pmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} X \\ Y \\ Z \end{pmatrix}$$

The matrix product on the right evaluates to $(f_x X + c_x Z,\; f_y Y + c_y Z,\; Z)^\top$; multiplying by $1/Z$ (normalizing the homogeneous coordinate) sets the third component to 1. Reading off the first two rows gives the scalar form:

$$u = f_x \dfrac{X}{Z} + c_x, \quad v = f_y \dfrac{Y}{Z} + c_y$$

This matrix is called the camera's intrinsic matrix $K$; we treat it in detail in the next chapter.

Code example (Python)

import numpy as np

# Camera parameters
fx, fy = 800, 800  # focal length (pixel units)
cx, cy = 320, 240  # principal point
f = 50             # focal length (mm)

# 3D point (camera coordinates)
P = np.array([1.0, 0.5, 5.0])  # (X, Y, Z) in meters

# Perspective projection
x = P[0] / P[2]  # X/Z
y = P[1] / P[2]  # Y/Z

# Convert to pixel coordinates
u = fx * x + cx
v = fy * y + cy
print(f"Pixel coordinates: ({u:.1f}, {v:.1f})")

# Field of view
sensor_width = 36  # mm (full-frame sensor)
fov = 2 * np.arctan(sensor_width / (2 * f))
print(f"Field of View: {np.degrees(fov):.1f}°")

# Matrix-form projection
K = np.array([
    [fx, 0, cx],
    [0, fy, cy],
    [0,  0,  1]
])

P_homogeneous = np.array([P[0], P[1], P[2]])
p_homogeneous = K @ P_homogeneous
p = p_homogeneous[:2] / p_homogeneous[2]
print(f"Pixel (matrix): ({p[0]:.1f}, {p[1]:.1f})")

Summary

  • The pinhole camera is the idealized model of perspective projection.
  • Perspective projection: $x = fX/Z$, $y = fY/Z$ (a non-linear transform).
  • Smaller focal length ⇒ wider field of view.
  • The principal point is the intersection of the optical axis and the image plane (often offset from the image center).
  • The intrinsic matrix $K$ packages the projection in matrix form.