Is Brent's method faster than golden section search?

On smooth functions, yes. Brent attempts superlinear convergence via parabolic interpolation and falls back to golden section when interpolation misbehaves. It guarantees the worst-case linear convergence of golden section while typically using 5 to 10 times fewer function evaluations.

When should I pick BFGS over L-BFGS?

BFGS for medium problems (up to a few hundred variables) where O(n²) memory is acceptable. L-BFGS for thousands to millions of variables, since it stores only m past gradient differences (typically 5 to 20) and uses O(mn) memory.

Which is faster, conjugate gradient or L-BFGS?

Problem dependent. CG uses only O(n) memory and is the lightest weight, but converges slowly on ill-conditioned problems. L-BFGS builds a quasi-Hessian from m history vectors and handles bad conditioning much better, which is why it has become the practical default for large-scale optimisation.

Are trust-region methods better than line search?

It depends. Trust-region methods exactly minimise a quadratic model inside the trust radius and stay stable when the Hessian is indefinite. Line search relies on the search direction being a descent direction, which requires modifying the Hessian if it is indefinite. Trust region is preferred in nonlinear least squares (Levenberg-Marquardt) and physics simulations.

What is the difference between the Armijo and Wolfe conditions?

Armijo (sufficient decrease) requires only f(x + αp) ≤ f(x) + c₁ α gᵀp. Wolfe adds a curvature condition |∇f(x + αp)ᵀp| ≤ c₂ |gᵀp| that prevents steps from being too small. The Wolfe condition is needed to prove that BFGS preserves positive definiteness.

Unconstrained Optimization — Gradient and Quasi-Newton Methods

Overview

Unconstrained optimisation $\min_{x \in \mathbb{R}^n} f(x)$ sits at the core of machine learning, signal processing, physics simulation, and statistical inference. sangi ships algorithms from 1D up to large-scale ($n \gtrsim 10^4$) in optimization_1d.hpp / optimization_nd.hpp. This article explains the theory and the implementation choices behind each.

Constrained optimisation, nonlinear least squares, and linear / convex programming live in Optimization-constrained-lp; derivative-free and metaheuristic methods in Optimization-derivative-free.

1D minimisation (Brent and golden section)

Minimising $f: \mathbb{R} \to \mathbb{R}$ over an interval $[a, b]$ is the canonical derivative-free problem.

Golden section search

Compare two interior points $x_1 < x_2$, discard the worse end, and shrink the interval by the golden ratio $\varphi = (1 + \sqrt{5})/2$. Convergence is linear (the interval scales by $1/\varphi \approx 0.618$ each step), and the information-per-evaluation is provably optimal under unimodality. The required number of evaluations is $O(\log(1/\varepsilon))$.

Parabolic interpolation

Fit a parabola through $(x_1, f_1), (x_2, f_2), (x_3, f_3)$ and use its vertex as the next trial. On smooth functions this attains superlinear convergence (order $\approx 1.324$), but it can fail when the parabola is ill-defined.

Brent's method

Brent (1973) hybridises parabolic interpolation with golden section.

Accept parabolic interpolation when the vertex lies in the interval, the step is at most half the previous interval width, and the iterates remain bounded.
Otherwise fall back to one golden-section step.

On smooth $f$ Brent's method needs 5–10x fewer evaluations than golden section, and on non-smooth $f$ it still delivers the worst-case linear convergence. brent_minimize is the standard 1D choice in sangi.

Line search (Armijo, Wolfe)

Multidimensional methods walk along a descent direction $p_k$ from the current point $x_k$: $\alpha_k = \arg\min_{\alpha > 0} \phi(\alpha)$, $\phi(\alpha) = f(x_k + \alpha p_k)$. Exact minimisation is wasteful; inexact line search picks an $\alpha$ that satisfies acceptance conditions.

Armijo (sufficient decrease)

$$f(x_k + \alpha p_k) \leq f(x_k) + c_1 \alpha \, \nabla f(x_k)^T p_k, \qquad c_1 \in (0, 1).$$

The function must decrease by at least a fraction of the linear model. Typical $c_1 = 10^{-4}$. Backtracking halves (more generally scales by $\rho \in (0, 1)$) an initial $\alpha$ until Armijo holds.

Wolfe

Adds a curvature condition on top of Armijo:

$$\nabla f(x_k + \alpha p_k)^T p_k \geq c_2 \, \nabla f(x_k)^T p_k, \qquad c_2 \in (c_1, 1).$$

Typical $c_2 = 0.9$ for Newton/BFGS, $c_2 = 0.1$ for CG. The condition prevents excessively small steps. BFGS positive-definiteness proofs require the (strong) Wolfe condition.

sangi uses Armijo backtracking through the LineSearchParams struct, shared by all gradient-based methods.

Steepest descent

Step along $p_k = -\nabla f(x_k)$: $x_{k+1} = x_k - \alpha_k \nabla f(x_k)$ with $\alpha_k$ from line search.

Convergence is linear; for a quadratic with condition number $\kappa$ the rate is $\bigl(\frac{\kappa - 1}{\kappa + 1}\bigr)^2$, so ill conditioning causes severe zigzagging. This is the deterministic full-gradient method, distinct from SGD or its momentum variants in deep learning.

sangi steepest_descent is provided as a baseline; production code should reach for BFGS, L-BFGS, or CG.

Nonlinear conjugate gradient (FR, PR+)

Linear CG (Hestenes-Stiefel 1952) solves symmetric positive definite systems $Ax = b$ in at most $n$ steps. Fletcher-Reeves (1964) and Polak-Ribière (1969) extended the idea to nonlinear optimisation by enforcing conjugacy with the previous search direction:

$$p_{k+1} = -g_{k+1} + \beta_{k+1} \, p_k, \qquad g_k = \nabla f(x_k).$$

Variant	$\beta_{k+1}$
Fletcher-Reeves	$\beta_{k+1}^{FR} = \dfrac{g_{k+1}^T g_{k+1}}{g_k^T g_k}$
Polak-Ribière	$\beta_{k+1}^{PR} = \dfrac{g_{k+1}^T (g_{k+1} - g_k)}{g_k^T g_k}$
Polak-Ribière+	$\beta_{k+1}^{PR+} = \max(0, \beta_{k+1}^{PR})$

PR+ is typically faster than FR in practice but needs a strong-Wolfe line search to ensure descent. The sangi conjugateGradientMinimize defaults to PR+ and resets via $p \leftarrow -g$ every $n$ steps.

$O(n)$ memory makes CG attractive for very large problems, but on ill-conditioned objectives L-BFGS usually wins.

Newton and modified Newton

Newton's method minimises the second-order Taylor model explicitly:

$$f(x_k + p) \approx f(x_k) + g_k^T p + \tfrac{1}{2} p^T H_k p \;\Longrightarrow\; p_k^{\text{Newton}} = -H_k^{-1} g_k.$$

If $H_k$ is symmetric positive definite (SPD), $p_k^{\text{Newton}}$ is a descent direction and convergence is quadratic.

When the Hessian is not SPD

Non-convex objectives and saddle neighbourhoods can produce indefinite or singular $H_k$. Remedies:

Modified Cholesky (Gill-Murray): bump up diagonals during factorisation to enforce SPD.
Levenberg-Marquardt damping: solve $(H + \lambda I) p = -g$.
Trust region: trust the quadratic model only within a radius (next section).
Truncated Newton (Newton-CG): solve $H p = -g$ inexactly with CG; terminate on negative curvature.

Newton-CG

sangi newton_cg_minimize is a truncated Newton method that solves $H p = -g$ via inner CG. It never stores $H$ explicitly; the user provides the Hessian-vector product $H(x) v$. Effective when $n$ is large and $H$ is sparse or structured.

BFGS and SR1

Quasi-Newton methods build an approximate Hessian (or its inverse) from gradient differences $y_k = g_{k+1} - g_k$ and step differences $s_k = x_{k+1} - x_k$ without computing $\nabla^2 f$ directly.

BFGS update (Broyden-Fletcher-Goldfarb-Shanno, 1970)

Rank-2 update of $B_k \approx H_k^{-1}$:

$$B_{k+1} = \Bigl(I - \frac{s_k y_k^T}{y_k^T s_k}\Bigr) B_k \Bigl(I - \frac{y_k s_k^T}{y_k^T s_k}\Bigr) + \frac{s_k s_k^T}{y_k^T s_k}.$$

Properties:

If $B_0$ is SPD and Wolfe holds, every $B_{k+1}$ stays SPD.
Superlinear convergence (Powell 1976).
$O(n^2)$ memory and $O(n^2)$ update cost.

BFGS is the de facto standard for medium-scale problems. sangi bfgs_minimize implements it.

SR1 (Symmetric Rank-1)

Update $B_{k+1} = B_k + \frac{(y_k - B_k s_k)(y_k - B_k s_k)^T}{(y_k - B_k s_k)^T s_k}$. The update is symmetric but loses positive definiteness; the approximation can be more accurate on non-convex problems and is usually paired with a trust-region framework.

L-BFGS — limited memory

Nocedal (1980) replaced the dense $B_k$ of BFGS with $m$ stored pairs $(s_i, y_i)$ and computed $B_k g_k$ with a two-loop recursion from a seed $B_0$ (usually $\gamma_k I$).

Two-loop recursion

q = g_k
for i = k-1 .. k-m:
    α_i = (s_i^T q) / (y_i^T s_i)
    q -= α_i * y_i
r = γ_k * q                  # γ_k = (s_{k-1}^T y_{k-1}) / (y_{k-1}^T y_{k-1})
for i = k-m .. k-1:
    β = (y_i^T r) / (y_i^T s_i)
    r += (α_i - β) * s_i
return r                     # = -B_k g_k

Typical $5 \leq m \leq 20$. Memory is $O(mn)$ and one iteration costs $O(mn)$, so problems with $n = 10^6$ are routine. L-BFGS dominates large-scale machine learning, physics simulation, and tomography.

sangi lbfgsMinimize implements the two-loop recursion; memory_size (default 10) controls $m$.

Trust region (LM, dogleg, Steihaug-CG)

Line search picks a direction, then a step. Trust region picks a step radius and then the best step inside it. Each iteration solves the sub-problem

$$\min_{p} \;m_k(p) = f_k + g_k^T p + \tfrac{1}{2} p^T B_k p \quad \text{subject to} \quad \|p\| \leq \Delta_k.$$

The radius $\Delta_k$ is updated based on the ratio $\rho_k = \dfrac{f(x_k) - f(x_k + p_k)}{m_k(0) - m_k(p_k)}$:

$\rho_k > 0.75 \Rightarrow \Delta_{k+1} \leftarrow 2 \Delta_k$ (expand).
$\rho_k < 0.25 \Rightarrow \Delta_{k+1} \leftarrow \Delta_k / 4$ (shrink).
$\rho_k > 0.1 \Rightarrow x_{k+1} \leftarrow x_k + p_k$ (accept); otherwise reject.

Approximate sub-problem solvers

Cauchy point: walk along the steepest descent direction until the radius. Always defined, slow convergence.
Dogleg: line-segment between the Cauchy point and the Newton step, take the intersection with the radius. Powell (1970). Requires SPD $B_k$.
Steihaug-CG: run CG on $B_k p = -g_k$, terminate at the trust boundary or when negative curvature is detected. Works for indefinite $B_k$ and large sparse Hessians.
Levenberg-Marquardt: $(B_k + \lambda I) p = -g$, the standard for nonlinear least squares. See Optimization-constrained-lp.

Trust-region methods stay stable for indefinite Hessians, which is why non-convex and ill-conditioned problems prefer them. sangi applies a Levenberg-Marquardt-style trust region inside gaussNewtonMinimize and leastSquaresFit.

Algorithm selection guide

Situation	Recommended
1D, smooth	`brent_minimize`
1D, non-smooth	`golden_section_search`
$n \leq 100$, gradient available	`bfgs_minimize`
$n \leq 10^3$, Hessian-vector available	`newton_cg_minimize`
$n \geq 10^3$, gradient only, ill conditioned	`lbfgsMinimize`
$n \geq 10^3$, tight memory	`conjugateGradientMinimize`
Nonlinear least squares $\sum r_i(x)^2$	`gaussNewtonMinimize` / `leastSquaresFit`
Costly or noisy gradients	`nelderMeadMinimize` etc. (derivative-free)

References

Nocedal, J. and Wright, S. J. (2006). Numerical Optimization, 2nd ed. Springer.
Fletcher, R. (1987). Practical Methods of Optimization, 2nd ed. Wiley.
Brent, R. P. (1973). Algorithms for Minimization without Derivatives. Prentice-Hall.
Liu, D. C. and Nocedal, J. (1989). "On the limited memory BFGS method for large scale optimization". Math. Programming, 45, 503–528.
Steihaug, T. (1983). "The Conjugate Gradient Method and Trust Regions in Large Scale Optimization". SIAM J. Numer. Anal., 20, 626–637.