Unconstrained Optimization — Gradient and Quasi-Newton Methods
Overview
Unconstrained optimisation $\min_{x \in \mathbb{R}^n} f(x)$ sits at the core of machine learning, signal processing,
physics simulation, and statistical inference. sangi ships algorithms from 1D up to large-scale ($n \gtrsim 10^4$) in
optimization_1d.hpp / optimization_nd.hpp.
This article explains the theory and the implementation choices behind each.
Constrained optimisation, nonlinear least squares, and linear / convex programming live in Optimization-constrained-lp; derivative-free and metaheuristic methods in Optimization-derivative-free.
1D minimisation (Brent and golden section)
Minimising $f: \mathbb{R} \to \mathbb{R}$ over an interval $[a, b]$ is the canonical derivative-free problem.
Golden section search
Compare two interior points $x_1 < x_2$, discard the worse end, and shrink the interval by the golden ratio $\varphi = (1 + \sqrt{5})/2$. Convergence is linear (the interval scales by $1/\varphi \approx 0.618$ each step), and the information-per-evaluation is provably optimal under unimodality. The required number of evaluations is $O(\log(1/\varepsilon))$.
Parabolic interpolation
Fit a parabola through $(x_1, f_1), (x_2, f_2), (x_3, f_3)$ and use its vertex as the next trial. On smooth functions this attains superlinear convergence (order $\approx 1.324$), but it can fail when the parabola is ill-defined.
Brent's method
Brent (1973) hybridises parabolic interpolation with golden section.
- Accept parabolic interpolation when the vertex lies in the interval, the step is at most half the previous interval width, and the iterates remain bounded.
- Otherwise fall back to one golden-section step.
On smooth $f$ Brent's method needs 5–10x fewer evaluations than golden section, and on non-smooth $f$ it still
delivers the worst-case linear convergence. brent_minimize is the standard 1D choice in sangi.
See also: Univariate optimisation / Foundations of optimisation
Line search (Armijo, Wolfe)
Multidimensional methods walk along a descent direction $p_k$ from the current point $x_k$: $\alpha_k = \arg\min_{\alpha > 0} \phi(\alpha)$, $\phi(\alpha) = f(x_k + \alpha p_k)$. Exact minimisation is wasteful; inexact line search picks an $\alpha$ that satisfies acceptance conditions.
Armijo (sufficient decrease)
$$f(x_k + \alpha p_k) \leq f(x_k) + c_1 \alpha \, \nabla f(x_k)^T p_k, \qquad c_1 \in (0, 1).$$
The function must decrease by at least a fraction of the linear model. Typical $c_1 = 10^{-4}$. Backtracking halves (more generally scales by $\rho \in (0, 1)$) an initial $\alpha$ until Armijo holds.
Wolfe
Adds a curvature condition on top of Armijo:
$$\nabla f(x_k + \alpha p_k)^T p_k \geq c_2 \, \nabla f(x_k)^T p_k, \qquad c_2 \in (c_1, 1).$$
Typical $c_2 = 0.9$ for Newton/BFGS, $c_2 = 0.1$ for CG. The condition prevents excessively small steps. BFGS positive-definiteness proofs require the (strong) Wolfe condition.
sangi uses Armijo backtracking through the LineSearchParams
struct, shared by all gradient-based methods.
See also: Line search
Steepest descent
Step along $p_k = -\nabla f(x_k)$: $x_{k+1} = x_k - \alpha_k \nabla f(x_k)$ with $\alpha_k$ from line search.
Convergence is linear; for a quadratic with condition number $\kappa$ the rate is $\bigl(\frac{\kappa - 1}{\kappa + 1}\bigr)^2$, so ill conditioning causes severe zigzagging. This is the deterministic full-gradient method, distinct from SGD or its momentum variants in deep learning.
sangi steepest_descent is provided as a baseline; production code should reach for BFGS, L-BFGS, or CG.
See also: Gradient descent
Nonlinear conjugate gradient (FR, PR+)
Linear CG (Hestenes-Stiefel 1952) solves symmetric positive definite systems $Ax = b$ in at most $n$ steps. Fletcher-Reeves (1964) and Polak-Ribière (1969) extended the idea to nonlinear optimisation by enforcing conjugacy with the previous search direction:
$$p_{k+1} = -g_{k+1} + \beta_{k+1} \, p_k, \qquad g_k = \nabla f(x_k).$$
| Variant | $\beta_{k+1}$ |
|---|---|
| Fletcher-Reeves | $\beta_{k+1}^{FR} = \dfrac{g_{k+1}^T g_{k+1}}{g_k^T g_k}$ |
| Polak-Ribière | $\beta_{k+1}^{PR} = \dfrac{g_{k+1}^T (g_{k+1} - g_k)}{g_k^T g_k}$ |
| Polak-Ribière+ | $\beta_{k+1}^{PR+} = \max(0, \beta_{k+1}^{PR})$ |
PR+ is typically faster than FR in practice but needs a strong-Wolfe line search to ensure descent.
The sangi conjugateGradientMinimize defaults to PR+ and resets via $p \leftarrow -g$ every $n$ steps.
$O(n)$ memory makes CG attractive for very large problems, but on ill-conditioned objectives L-BFGS usually wins.
Newton and modified Newton
Newton's method minimises the second-order Taylor model explicitly:
$$f(x_k + p) \approx f(x_k) + g_k^T p + \tfrac{1}{2} p^T H_k p \;\Longrightarrow\; p_k^{\text{Newton}} = -H_k^{-1} g_k.$$
If $H_k$ is symmetric positive definite (SPD), $p_k^{\text{Newton}}$ is a descent direction and convergence is quadratic.
When the Hessian is not SPD
Non-convex objectives and saddle neighbourhoods can produce indefinite or singular $H_k$. Remedies:
- Modified Cholesky (Gill-Murray): bump up diagonals during factorisation to enforce SPD.
- Levenberg-Marquardt damping: solve $(H + \lambda I) p = -g$.
- Trust region: trust the quadratic model only within a radius (next section).
- Truncated Newton (Newton-CG): solve $H p = -g$ inexactly with CG; terminate on negative curvature.
Newton-CG
sangi newton_cg_minimize is a truncated Newton method that solves $H p = -g$ via inner CG.
It never stores $H$ explicitly; the user provides the Hessian-vector product $H(x) v$.
Effective when $n$ is large and $H$ is sparse or structured.
See also: Newton's method
BFGS and SR1
Quasi-Newton methods build an approximate Hessian (or its inverse) from gradient differences $y_k = g_{k+1} - g_k$ and step differences $s_k = x_{k+1} - x_k$ without computing $\nabla^2 f$ directly.
BFGS update (Broyden-Fletcher-Goldfarb-Shanno, 1970)
Rank-2 update of $B_k \approx H_k^{-1}$:
$$B_{k+1} = \Bigl(I - \frac{s_k y_k^T}{y_k^T s_k}\Bigr) B_k \Bigl(I - \frac{y_k s_k^T}{y_k^T s_k}\Bigr) + \frac{s_k s_k^T}{y_k^T s_k}.$$
Properties:
- If $B_0$ is SPD and Wolfe holds, every $B_{k+1}$ stays SPD.
- Superlinear convergence (Powell 1976).
- $O(n^2)$ memory and $O(n^2)$ update cost.
BFGS is the de facto standard for medium-scale problems. sangi bfgs_minimize implements it.
SR1 (Symmetric Rank-1)
Update $B_{k+1} = B_k + \frac{(y_k - B_k s_k)(y_k - B_k s_k)^T}{(y_k - B_k s_k)^T s_k}$. The update is symmetric but loses positive definiteness; the approximation can be more accurate on non-convex problems and is usually paired with a trust-region framework.
See also: Quasi-Newton methods
L-BFGS — limited memory
Nocedal (1980) replaced the dense $B_k$ of BFGS with $m$ stored pairs $(s_i, y_i)$ and computed $B_k g_k$ with a two-loop recursion from a seed $B_0$ (usually $\gamma_k I$).
Two-loop recursion
q = g_k
for i = k-1 .. k-m:
α_i = (s_i^T q) / (y_i^T s_i)
q -= α_i * y_i
r = γ_k * q # γ_k = (s_{k-1}^T y_{k-1}) / (y_{k-1}^T y_{k-1})
for i = k-m .. k-1:
β = (y_i^T r) / (y_i^T s_i)
r += (α_i - β) * s_i
return r # = -B_k g_k
Typical $5 \leq m \leq 20$. Memory is $O(mn)$ and one iteration costs $O(mn)$, so problems with $n = 10^6$ are routine. L-BFGS dominates large-scale machine learning, physics simulation, and tomography.
sangi lbfgsMinimize implements the two-loop recursion; memory_size (default 10) controls $m$.
Trust region (LM, dogleg, Steihaug-CG)
Line search picks a direction, then a step. Trust region picks a step radius and then the best step inside it. Each iteration solves the sub-problem
$$\min_{p} \;m_k(p) = f_k + g_k^T p + \tfrac{1}{2} p^T B_k p \quad \text{subject to} \quad \|p\| \leq \Delta_k.$$
The radius $\Delta_k$ is updated based on the ratio $\rho_k = \dfrac{f(x_k) - f(x_k + p_k)}{m_k(0) - m_k(p_k)}$:
- $\rho_k > 0.75 \Rightarrow \Delta_{k+1} \leftarrow 2 \Delta_k$ (expand).
- $\rho_k < 0.25 \Rightarrow \Delta_{k+1} \leftarrow \Delta_k / 4$ (shrink).
- $\rho_k > 0.1 \Rightarrow x_{k+1} \leftarrow x_k + p_k$ (accept); otherwise reject.
Approximate sub-problem solvers
- Cauchy point: walk along the steepest descent direction until the radius. Always defined, slow convergence.
- Dogleg: line-segment between the Cauchy point and the Newton step, take the intersection with the radius. Powell (1970). Requires SPD $B_k$.
- Steihaug-CG: run CG on $B_k p = -g_k$, terminate at the trust boundary or when negative curvature is detected. Works for indefinite $B_k$ and large sparse Hessians.
- Levenberg-Marquardt: $(B_k + \lambda I) p = -g$, the standard for nonlinear least squares. See Optimization-constrained-lp.
Trust-region methods stay stable for indefinite Hessians, which is why non-convex and ill-conditioned problems
prefer them. sangi applies a Levenberg-Marquardt-style trust region inside gaussNewtonMinimize and
leastSquaresFit.
See also: Trust-region methods
Algorithm selection guide
| Situation | Recommended |
|---|---|
| 1D, smooth | brent_minimize |
| 1D, non-smooth | golden_section_search |
| $n \leq 100$, gradient available | bfgs_minimize |
| $n \leq 10^3$, Hessian-vector available | newton_cg_minimize |
| $n \geq 10^3$, gradient only, ill conditioned | lbfgsMinimize |
| $n \geq 10^3$, tight memory | conjugateGradientMinimize |
| Nonlinear least squares $\sum r_i(x)^2$ | gaussNewtonMinimize / leastSquaresFit |
| Costly or noisy gradients | nelderMeadMinimize etc. (derivative-free) |
References
- Nocedal, J. and Wright, S. J. (2006). Numerical Optimization, 2nd ed. Springer.
- Fletcher, R. (1987). Practical Methods of Optimization, 2nd ed. Wiley.
- Brent, R. P. (1973). Algorithms for Minimization without Derivatives. Prentice-Hall.
- Liu, D. C. and Nocedal, J. (1989). "On the limited memory BFGS method for large scale optimization". Math. Programming, 45, 503–528.
- Steihaug, T. (1983). "The Conjugate Gradient Method and Trust Regions in Large Scale Optimization". SIAM J. Numer. Anal., 20, 626–637.