Unconstrained Optimization — Gradient and Quasi-Newton Methods

Overview

Unconstrained optimisation $\min_{x \in \mathbb{R}^n} f(x)$ sits at the core of machine learning, signal processing, physics simulation, and statistical inference. sangi ships algorithms from 1D up to large-scale ($n \gtrsim 10^4$) in optimization_1d.hpp / optimization_nd.hpp. This article explains the theory and the implementation choices behind each.

Constrained optimisation, nonlinear least squares, and linear / convex programming live in Optimization-constrained-lp; derivative-free and metaheuristic methods in Optimization-derivative-free.

1D minimisation (Brent and golden section)

Minimising $f: \mathbb{R} \to \mathbb{R}$ over an interval $[a, b]$ is the canonical derivative-free problem.

Golden section search

Compare two interior points $x_1 < x_2$, discard the worse end, and shrink the interval by the golden ratio $\varphi = (1 + \sqrt{5})/2$. Convergence is linear (the interval scales by $1/\varphi \approx 0.618$ each step), and the information-per-evaluation is provably optimal under unimodality. The required number of evaluations is $O(\log(1/\varepsilon))$.

Parabolic interpolation

Fit a parabola through $(x_1, f_1), (x_2, f_2), (x_3, f_3)$ and use its vertex as the next trial. On smooth functions this attains superlinear convergence (order $\approx 1.324$), but it can fail when the parabola is ill-defined.

Brent's method

Brent (1973) hybridises parabolic interpolation with golden section.

  • Accept parabolic interpolation when the vertex lies in the interval, the step is at most half the previous interval width, and the iterates remain bounded.
  • Otherwise fall back to one golden-section step.

On smooth $f$ Brent's method needs 5–10x fewer evaluations than golden section, and on non-smooth $f$ it still delivers the worst-case linear convergence. brent_minimize is the standard 1D choice in sangi.

Steepest descent

Step along $p_k = -\nabla f(x_k)$: $x_{k+1} = x_k - \alpha_k \nabla f(x_k)$ with $\alpha_k$ from line search.

Convergence is linear; for a quadratic with condition number $\kappa$ the rate is $\bigl(\frac{\kappa - 1}{\kappa + 1}\bigr)^2$, so ill conditioning causes severe zigzagging. This is the deterministic full-gradient method, distinct from SGD or its momentum variants in deep learning.

sangi steepest_descent is provided as a baseline; production code should reach for BFGS, L-BFGS, or CG.

Nonlinear conjugate gradient (FR, PR+)

Linear CG (Hestenes-Stiefel 1952) solves symmetric positive definite systems $Ax = b$ in at most $n$ steps. Fletcher-Reeves (1964) and Polak-Ribière (1969) extended the idea to nonlinear optimisation by enforcing conjugacy with the previous search direction:

$$p_{k+1} = -g_{k+1} + \beta_{k+1} \, p_k, \qquad g_k = \nabla f(x_k).$$

Variant$\beta_{k+1}$
Fletcher-Reeves$\beta_{k+1}^{FR} = \dfrac{g_{k+1}^T g_{k+1}}{g_k^T g_k}$
Polak-Ribière$\beta_{k+1}^{PR} = \dfrac{g_{k+1}^T (g_{k+1} - g_k)}{g_k^T g_k}$
Polak-Ribière+$\beta_{k+1}^{PR+} = \max(0, \beta_{k+1}^{PR})$

PR+ is typically faster than FR in practice but needs a strong-Wolfe line search to ensure descent. The sangi conjugateGradientMinimize defaults to PR+ and resets via $p \leftarrow -g$ every $n$ steps.

$O(n)$ memory makes CG attractive for very large problems, but on ill-conditioned objectives L-BFGS usually wins.

Newton and modified Newton

Newton's method minimises the second-order Taylor model explicitly:

$$f(x_k + p) \approx f(x_k) + g_k^T p + \tfrac{1}{2} p^T H_k p \;\Longrightarrow\; p_k^{\text{Newton}} = -H_k^{-1} g_k.$$

If $H_k$ is symmetric positive definite (SPD), $p_k^{\text{Newton}}$ is a descent direction and convergence is quadratic.

When the Hessian is not SPD

Non-convex objectives and saddle neighbourhoods can produce indefinite or singular $H_k$. Remedies:

  • Modified Cholesky (Gill-Murray): bump up diagonals during factorisation to enforce SPD.
  • Levenberg-Marquardt damping: solve $(H + \lambda I) p = -g$.
  • Trust region: trust the quadratic model only within a radius (next section).
  • Truncated Newton (Newton-CG): solve $H p = -g$ inexactly with CG; terminate on negative curvature.

Newton-CG

sangi newton_cg_minimize is a truncated Newton method that solves $H p = -g$ via inner CG. It never stores $H$ explicitly; the user provides the Hessian-vector product $H(x) v$. Effective when $n$ is large and $H$ is sparse or structured.

See also: Newton's method

BFGS and SR1

Quasi-Newton methods build an approximate Hessian (or its inverse) from gradient differences $y_k = g_{k+1} - g_k$ and step differences $s_k = x_{k+1} - x_k$ without computing $\nabla^2 f$ directly.

BFGS update (Broyden-Fletcher-Goldfarb-Shanno, 1970)

Rank-2 update of $B_k \approx H_k^{-1}$:

$$B_{k+1} = \Bigl(I - \frac{s_k y_k^T}{y_k^T s_k}\Bigr) B_k \Bigl(I - \frac{y_k s_k^T}{y_k^T s_k}\Bigr) + \frac{s_k s_k^T}{y_k^T s_k}.$$

Properties:

  • If $B_0$ is SPD and Wolfe holds, every $B_{k+1}$ stays SPD.
  • Superlinear convergence (Powell 1976).
  • $O(n^2)$ memory and $O(n^2)$ update cost.

BFGS is the de facto standard for medium-scale problems. sangi bfgs_minimize implements it.

SR1 (Symmetric Rank-1)

Update $B_{k+1} = B_k + \frac{(y_k - B_k s_k)(y_k - B_k s_k)^T}{(y_k - B_k s_k)^T s_k}$. The update is symmetric but loses positive definiteness; the approximation can be more accurate on non-convex problems and is usually paired with a trust-region framework.

L-BFGS — limited memory

Nocedal (1980) replaced the dense $B_k$ of BFGS with $m$ stored pairs $(s_i, y_i)$ and computed $B_k g_k$ with a two-loop recursion from a seed $B_0$ (usually $\gamma_k I$).

Two-loop recursion

q = g_k
for i = k-1 .. k-m:
    α_i = (s_i^T q) / (y_i^T s_i)
    q -= α_i * y_i
r = γ_k * q                  # γ_k = (s_{k-1}^T y_{k-1}) / (y_{k-1}^T y_{k-1})
for i = k-m .. k-1:
    β = (y_i^T r) / (y_i^T s_i)
    r += (α_i - β) * s_i
return r                     # = -B_k g_k

Typical $5 \leq m \leq 20$. Memory is $O(mn)$ and one iteration costs $O(mn)$, so problems with $n = 10^6$ are routine. L-BFGS dominates large-scale machine learning, physics simulation, and tomography.

sangi lbfgsMinimize implements the two-loop recursion; memory_size (default 10) controls $m$.

Trust region (LM, dogleg, Steihaug-CG)

Line search picks a direction, then a step. Trust region picks a step radius and then the best step inside it. Each iteration solves the sub-problem

$$\min_{p} \;m_k(p) = f_k + g_k^T p + \tfrac{1}{2} p^T B_k p \quad \text{subject to} \quad \|p\| \leq \Delta_k.$$

The radius $\Delta_k$ is updated based on the ratio $\rho_k = \dfrac{f(x_k) - f(x_k + p_k)}{m_k(0) - m_k(p_k)}$:

  • $\rho_k > 0.75 \Rightarrow \Delta_{k+1} \leftarrow 2 \Delta_k$ (expand).
  • $\rho_k < 0.25 \Rightarrow \Delta_{k+1} \leftarrow \Delta_k / 4$ (shrink).
  • $\rho_k > 0.1 \Rightarrow x_{k+1} \leftarrow x_k + p_k$ (accept); otherwise reject.

Approximate sub-problem solvers

  • Cauchy point: walk along the steepest descent direction until the radius. Always defined, slow convergence.
  • Dogleg: line-segment between the Cauchy point and the Newton step, take the intersection with the radius. Powell (1970). Requires SPD $B_k$.
  • Steihaug-CG: run CG on $B_k p = -g_k$, terminate at the trust boundary or when negative curvature is detected. Works for indefinite $B_k$ and large sparse Hessians.
  • Levenberg-Marquardt: $(B_k + \lambda I) p = -g$, the standard for nonlinear least squares. See Optimization-constrained-lp.

Trust-region methods stay stable for indefinite Hessians, which is why non-convex and ill-conditioned problems prefer them. sangi applies a Levenberg-Marquardt-style trust region inside gaussNewtonMinimize and leastSquaresFit.

Algorithm selection guide

SituationRecommended
1D, smoothbrent_minimize
1D, non-smoothgolden_section_search
$n \leq 100$, gradient availablebfgs_minimize
$n \leq 10^3$, Hessian-vector availablenewton_cg_minimize
$n \geq 10^3$, gradient only, ill conditionedlbfgsMinimize
$n \geq 10^3$, tight memoryconjugateGradientMinimize
Nonlinear least squares $\sum r_i(x)^2$gaussNewtonMinimize / leastSquaresFit
Costly or noisy gradientsnelderMeadMinimize etc. (derivative-free)

References

  • Nocedal, J. and Wright, S. J. (2006). Numerical Optimization, 2nd ed. Springer.
  • Fletcher, R. (1987). Practical Methods of Optimization, 2nd ed. Wiley.
  • Brent, R. P. (1973). Algorithms for Minimization without Derivatives. Prentice-Hall.
  • Liu, D. C. and Nocedal, J. (1989). "On the limited memory BFGS method for large scale optimization". Math. Programming, 45, 503–528.
  • Steihaug, T. (1983). "The Conjugate Gradient Method and Trust Regions in Large Scale Optimization". SIAM J. Numer. Anal., 20, 626–637.