Pinhole Camera Model
Pinhole Camera Model
What is a pinhole camera?
The pinhole camera is the simplest camera model: light enters through a small hole (the pinhole) and forms an image on a screen on the opposite side. Because no lens is used, it serves as an idealized geometric model widely used in CV and CG.
Mathematics of perspective projection
The perspective projection of a 3D point $P = (X, Y, Z)$ to an image-plane point $p = (x, y)$ follows from the geometry of similar triangles:
Note on convention: Figure 1 shows the physical pinhole geometry (image plane behind the camera, at $-Z$). For the derivations that follow, we instead use the virtual image plane model (image plane in front of the camera, at $+Z$, i.e., $Z = f$) shown in Figure 2. The two models differ only in whether the image is flipped; the similar-triangle proportions are identical. The virtual image plane is the standard convention in CV/CG textbooks because it removes the sign flips and is easier to manipulate.
Perspective projection formulas
Projection of a 3D point $(X, Y, Z)$ onto the image-plane point $(x, y)$:
$$x = f \dfrac{X}{Z}, \quad y = f \dfrac{Y}{Z}$$where $f$ is the focal length and $Z$ is the depth.
Important: Because the formulas divide by $Z$, perspective projection is not a linear transformation in the usual sense (it cannot be written as a single matrix product). However, if we use homogeneous coordinates ($(X, Y, Z) \to (X, Y, Z, 1)$, covered in the next chapter), the transformation can be split into (1) a part expressible as a matrix product and (2) a final division by $Z$ (normalization of the homogeneous coordinate). The non-linearity is confined to step (2), and step (1) lets us treat rotation, translation, and projection in a unified matrix framework.
Field of View (FOV)
The field of view (FOV) is the angular extent of the scene that the camera captures. It is determined by the focal length and the sensor size.
Definition: Field of View (FOV)
The maximum angle of the scene captured by the camera. Centered on the optical axis, $\theta$ is the angle between the two rays from the pinhole $O$ to the two ends of the image plane (sensor).
Depending on whether the width $w$, height $h$, or diagonal $d$ of the sensor is used, one obtains the horizontal, vertical, or diagonal field of view (different standards and use cases prefer different ones).
Figure 3 illustrates how the FOV changes when the focal length $f$ varies.
Deriving the FOV formula
The FOV follows from a simple right triangle. Because the optical axis is perpendicular to the sensor, the points $O$ (pinhole), the sensor center, and the top edge of the sensor form a right triangle.
In Figure 4 the adjacent leg of the right triangle is the focal length $f$ and the opposite leg is the sensor half-width $w/2$. For the half angle $\theta/2$ we therefore have
$$\tan\!\left(\dfrac{\theta}{2}\right) = \dfrac{w/2}{f} = \dfrac{w}{2f}.$$
Taking $\arctan$ on both sides and doubling gives the full FOV $\theta$.
FOV formula
$$\boxed{\;\theta = 2\,\arctan\!\left(\dfrac{w}{2f}\right)\;}$$Substitute the sensor width for $w$ to get the horizontal FOV, the height for the vertical FOV, or the diagonal $d$ for the diagonal FOV.
Larger $f$ ⇒ smaller $\theta$ (telephoto); smaller $f$ ⇒ larger $\theta$ (wide angle).
Principal point and skew
In real cameras the optical axis does not always pass through the center of the image.
Conversion to pixel coordinates
From normalized image coordinates $(x, y)$ to pixel coordinates $(u, v)$:
$$u = f_x \cdot x + c_x, \quad v = f_y \cdot y + c_y$$- $f_x, f_y$: focal length in pixel units ($f_x = f / p_x$, where $p_x$ is the pixel size)
- $(c_x, c_y)$: pixel coordinates of the principal point
Combined camera model
Combining perspective projection with the pixel coordinate transform:
$$\begin{pmatrix} u \\ v \\ 1 \end{pmatrix} = \dfrac{1}{Z} \begin{pmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} X \\ Y \\ Z \end{pmatrix}$$The matrix product on the right evaluates to $(f_x X + c_x Z,\; f_y Y + c_y Z,\; Z)^\top$; multiplying by $1/Z$ (normalizing the homogeneous coordinate) sets the third component to 1. Reading off the first two rows gives the scalar form:
$$u = f_x \dfrac{X}{Z} + c_x, \quad v = f_y \dfrac{Y}{Z} + c_y$$This matrix is called the camera's intrinsic matrix $K$; we treat it in detail in the next chapter.
Code example (Python)
import numpy as np
# Camera parameters
fx, fy = 800, 800 # focal length (pixel units)
cx, cy = 320, 240 # principal point
f = 50 # focal length (mm)
# 3D point (camera coordinates)
P = np.array([1.0, 0.5, 5.0]) # (X, Y, Z) in meters
# Perspective projection
x = P[0] / P[2] # X/Z
y = P[1] / P[2] # Y/Z
# Convert to pixel coordinates
u = fx * x + cx
v = fy * y + cy
print(f"Pixel coordinates: ({u:.1f}, {v:.1f})")
# Field of view
sensor_width = 36 # mm (full-frame sensor)
fov = 2 * np.arctan(sensor_width / (2 * f))
print(f"Field of View: {np.degrees(fov):.1f}°")
# Matrix-form projection
K = np.array([
[fx, 0, cx],
[0, fy, cy],
[0, 0, 1]
])
P_homogeneous = np.array([P[0], P[1], P[2]])
p_homogeneous = K @ P_homogeneous
p = p_homogeneous[:2] / p_homogeneous[2]
print(f"Pixel (matrix): ({p[0]:.1f}, {p[1]:.1f})")
Summary
- The pinhole camera is the idealized model of perspective projection.
- Perspective projection: $x = fX/Z$, $y = fY/Z$ (a non-linear transform).
- Smaller focal length ⇒ wider field of view.
- The principal point is the intersection of the optical axis and the image plane (often offset from the image center).
- The intrinsic matrix $K$ packages the projection in matrix form.