Derivative-Free Optimization and Metaheuristics
Overview
Derivative-Free Optimization (DFO) solves $\min f(x)$ using only function evaluations. It applies when gradients are unavailable, analytically intractable, expensive to evaluate, or noisy. This article surveys deterministic direct search (Nelder-Mead, pattern search, DIRECT, BOBYQA) together with stochastic metaheuristics (SA, GA, PSO, DE, CMA-ES).
sangi exposes these algorithms in optimization_nd.hpp.
Metaheuristics with box constraints $\mathbf{x} \in [l, u]$ use the bounded signature.
Nelder-Mead simplex
Nelder and Mead (1965) iteratively reshape a simplex $\{x_0, x_1, \dots, x_n\}$ of $n + 1$ vertices through four geometric moves. Vertices are sorted by function value ($f(x_0) \leq \cdots \leq f(x_n)$), and the worst vertex $x_n$ is improved against the centroid $\bar{x} = \frac{1}{n} \sum_{i=0}^{n-1} x_i$.
| Move | Formula | Trigger |
|---|---|---|
| Reflection | $x_r = \bar{x} + \alpha (\bar{x} - x_n)$ | $\alpha = 1$ |
| Expansion | $x_e = \bar{x} + \gamma (x_r - \bar{x})$ | $\gamma = 2$, if $x_r$ is best |
| Outer contraction | $x_c = \bar{x} + \rho (x_r - \bar{x})$ | $\rho = 0.5$, if $x_r$ is mid-ranked |
| Inner contraction | $x_c = \bar{x} - \rho (x_r - \bar{x})$ | $\rho = 0.5$, if $x_r$ is worst |
| Shrink | $x_i \leftarrow x_0 + \sigma (x_i - x_0)$ | $\sigma = 0.5$, all above fail |
Stopping criteria
Use a joint test: $\max_i f(x_i) - f(x_0) < \varepsilon_f$ and $\max_i \|x_i - x_0\| < \varepsilon_x$. Neither alone reliably detects stalling.
Caveats
- Excellent at $n \leq 10$, deteriorates beyond $n \sim 20$.
- Repeated shrinks on ill-conditioned problems cause stagnation; restart the simplex if necessary.
- Convergence is not guaranteed in general (McKinnon 1998 gives a counter-example).
sangi nelderMeadMinimize exposes the coefficients via
NelderMeadOptions.
See also: Nelder-Mead method
Pattern search (Hooke-Jeeves, GPS, MADS)
Hooke-Jeeves
Hooke and Jeeves (1961) introduced the original pattern search, alternating two phases:
- Exploratory move: probe each coordinate axis $e_i$ at $\pm \Delta$ and accept improvements.
- Pattern move: take another step along the direction that improved.
If no $\pm \Delta$ improves, shrink $\Delta \leftarrow \Delta / 2$ (or $\Delta / \sqrt{2}$) and check convergence.
sangi hookeJeevesMinimize shrinks by $\sqrt{2}$ and stops when
$\Delta x < $ min_step.
GPS (Generalized Pattern Search)
Torczon (1997) generalised Hooke-Jeeves to arbitrary positive bases $D$ (every vector is a non-negative combination of $D$). Under smoothness GPS provably converges to a local minimum.
MADS (Mesh Adaptive Direct Search)
Audet and Dennis (2006) let the direction set change adaptively with the mesh size, recovering convergence to Clarke stationary points on non-smooth objectives. The NOMAD solver implements MADS for black-box engineering optimisation.
See also: Direct search methods
DIRECT and BOBYQA
DIRECT (DIvide a RECTangle)
Jones, Perttunen, and Stuckle (1993) split the search box into hyper-rectangles, sample the centre of each, and recursively subdivide the "potentially optimal" rectangles. No Lipschitz constant is required. DIRECT is one of the rare deterministic global optimisers, but the curse of dimensionality limits it to $n \leq 10$ in practice.
BOBYQA (Bound Optimization BY Quadratic Approximation)
Powell (2009) builds a quadratic interpolation model on $\binom{n+1}{2} + n + 1$ points and minimises it with a trust-region method. It handles box constraints and is the default choice for expensive engineering design problems with no gradient.
Related Powell-family solvers include COBYLA (linear interpolation with constraints), NEWUOA (unconstrained BOBYQA), and UOBYQA (the original quadratic-interpolation version). BOBYQA and COBYLA are not yet shipped in the current sangi release; they are on the roadmap.
Where Powell-family methods still shine
Even with Bayesian optimisation (Gaussian-process surrogate plus acquisition function) now dominant in expensive black-box settings, Powell-family quadratic interpolation remains strong on problems with no gradient, expensive evaluations, and $n \leq 30$.
Simulated annealing (SA)
Kirkpatrick, Gelatt, and Vecchi (1983) introduced SA by analogy with the annealing of metals. Worse-than-current proposals are accepted with a Metropolis probability that depends on temperature $T_k$:
$$P(\text{accept}) = \begin{cases} 1 & f(x') \leq f(x) \\ \exp\!\bigl(-(f(x') - f(x))/T_k\bigr) & f(x') > f(x) \end{cases}.$$
Cooling schedules
- Exponential: $T_{k+1} = r \cdot T_k$ with $r \in [0.9, 0.999]$. Common in practice.
- Logarithmic: $T_k = T_0 / \log(k + 2)$. Provably converges to the global optimum, but impractically slowly.
- Adaptive: adjust $r$ based on the acceptance rate.
sangi simulatedAnnealingMinimize uses exponential cooling
(default $r = 0.995$) with Gaussian neighbours whose standard deviation is a fraction (neighbor_scale = 0.1)
of the box width.
See also: Simulated annealing
Genetic algorithm (GA)
Holland (1975) introduced evolutionary computation through populations evolved via selection, crossover, mutation, and elitism.
- Selection: tournament selection (pick the best of $k$ random members) is the standard.
- Crossover: BLX-$\alpha$ (Blend Crossover) and SBX (Simulated Binary Crossover) for continuous variables. BLX-$\alpha$ samples uniformly from $[p_1 - \alpha d, p_2 + \alpha d]$ where $p_1 \leq p_2$ and $d = p_2 - p_1$.
- Mutation: Gaussian $x' = x + \mathcal{N}(0, \sigma^2)$ or polynomial mutation.
- Elitism: copy the top $k$ individuals into the next generation to prevent regression.
sangi geneticAlgorithmMinimize uses tournament selection,
BLX-$0.5$ crossover, Gaussian mutation, and elitism (default 2). GA is general purpose but, on continuous
problems, usually trails CMA-ES and DE; it shines when the encoding has discrete or combinatorial structure.
See also: Genetic algorithms
Particle swarm optimisation (PSO)
Kennedy and Eberhart (1995) update each particle's position $x_i$ and velocity $v_i$ from its personal best $p_i$ and the swarm-wide best $g$:
$$v_i \leftarrow w \, v_i + c_1 r_1 (p_i - x_i) + c_2 r_2 (g - x_i), \qquad x_i \leftarrow x_i + v_i.$$
Typical $w = 0.7$, $c_1, c_2 \in [1.5, 2.0]$, $r_1, r_2 \sim \mathcal{U}[0, 1]$. Easy to implement and embarrassingly parallel; the tendency to get stuck near a local optimum is usually mitigated by hybridisation.
sangi does not ship a standalone PSO in the current release; DE, CMA-ES, and jSO cover overlapping use cases.
See also: Swarm intelligence (incl. PSO)
Differential evolution (DE, jSO)
DE (Storn-Price 1997)
DE/rand/1/bin builds trial vectors with a simple difference mutation:
$$v_i = x_{r_0} + F (x_{r_1} - x_{r_2}), \quad r_0, r_1, r_2 \text{ pairwise distinct}.$$
Binomial crossover yields $u_i$, accepted greedily ($x_i \leftarrow u_i$) when $f(u_i) < f(x_i)$. Typical $F = 0.8$, $CR = 0.9$. Low overhead and surprisingly competitive on continuous global problems.
sangi differentialEvolutionMinimize implements DE/rand/1/bin.
L-SHADE and jSO
SHADE (Tanabe and Fukunaga 2013) keeps a history of successful $(F, CR)$ values and resamples adaptively. L-SHADE adds linear population size reduction (CEC 2014 winner). jSO (Brest et al. 2017) layers a different $F/CR$ initialisation and the current-to-pbest/1 mutation on top of L-SHADE, finishing near the top of CEC 2017.
sangi jsoMinimize implements jSO. It is one of the most
reliable metaheuristics shipped, alongside CMA-ES.
CMA-ES
Hansen and Ostermeier (2001) sample $\lambda$ points from $\mathcal{N}(m_k, \sigma_k^2 C_k)$, select the top $\mu_w$, and update $m, \sigma, C$ from the weighted offspring.
Update outline
- Sample $\lambda$ candidates: $x_i \sim \mathcal{N}(m_k, \sigma_k^2 C_k)$.
- Sort by function value; compute the weighted mean $m_{k+1} = \sum_{i=1}^{\mu_w} w_i x_{i:\lambda}$.
- Update evolution paths $p_\sigma$ and $p_c$ with cumulative information.
- Update $C_{k+1}$ via combined rank-1 and rank-$\mu_w$ updates.
- Update $\sigma_{k+1}$ via cumulative step-size adaptation (CSA) from $\|p_\sigma\|$.
$C$ approaches the inverse Hessian on quadratic objectives, so CMA-ES behaves quasi-Newton-like without derivatives. Strong on ill-conditioned, non-separable problems. Default $\lambda = 4 + \lfloor 3 \ln n \rfloor$ scales gracefully up to a few hundred dimensions.
sangi cmaesMinimize implements the standard $(\mu/\mu_w, \lambda)$-CMA-ES
with condition_limit = 10^{14} to keep $C$ in check.
See also: Evolutionary strategies (incl. CMA-ES)
Noisy and non-smooth objectives
When evaluations are noisy (Monte Carlo simulations, hardware measurements, stochastic simulators):
- Re-evaluate: average several evaluations of the same $x$. Sample count typically grows with the generation.
- Noise-robust algorithms: population-based methods like CMA-ES average noise implicitly. Nelder-Mead is fragile under noise.
- Loose tolerance: set the stopping tolerance above the noise floor.
For non-smooth objectives (piecewise-linear, discontinuous, $\max/\min$):
- MADS / GPS converge to Clarke stationary points under modest assumptions.
- Nelder-Mead often works but lacks guarantees.
- Metaheuristics (SA, GA, DE, CMA-ES) run without continuity, with only probabilistic guarantees.
When the non-smoothness is known (e.g. $\max(0, x)$), smoothing into $\log(1 + e^x)$ and switching back to a gradient method is often the fastest path.
Selection guide
| Situation | Recommended (sangi) |
|---|---|
| $n \leq 10$, smooth, cheap evaluations | nelderMeadMinimize |
| $n \leq 30$, cheap evaluations | hookeJeevesMinimize |
| $n \leq $ a few hundred, ill conditioned | cmaesMinimize |
| $n \leq $ a few hundred, benchmark-strength solver | jsoMinimize |
| General continuous global search | differentialEvolutionMinimize |
| Discrete / combinatorial structure | geneticAlgorithmMinimize |
| Need to escape local minima | simulatedAnnealingMinimize |
| Very expensive evaluations (> seconds each) | Bayesian optimisation (external) or BOBYQA (planned) |
References
- Conn, A. R., Scheinberg, K., and Vicente, L. N. (2009). Introduction to Derivative-Free Optimization. SIAM.
- Nelder, J. A. and Mead, R. (1965). "A Simplex Method for Function Minimization". Computer Journal, 7, 308–313.
- Hansen, N. and Ostermeier, A. (2001). "Completely Derandomized Self-Adaptation in Evolution Strategies". Evolutionary Computation, 9, 159–195.
- Storn, R. and Price, K. (1997). "Differential Evolution — A Simple and Efficient Heuristic for global Optimization over Continuous Spaces". J. Global Optimization, 11, 341–359.
- Brest, J., Maučec, M. S., and Bošković, B. (2017). "Single objective real-parameter optimization: Algorithm jSO". CEC 2017.
- Audet, C. and Dennis, J. E. (2006). "Mesh Adaptive Direct Search Algorithms for Constrained Optimization". SIAM J. Optim., 17, 188–217.