When should I use derivative-free optimisation?

Whenever the gradient is unavailable analytically, numerical differentiation is too expensive, the objective is noisy, or the function is non-smooth. Concrete examples include black-box simulator tuning, hardware calibration, and machine-learning hyperparameter search.

Why does Nelder-Mead lose performance in high dimensions?

The method evolves an n + 1 vertex simplex, so the information acquired per function evaluation drops as n grows. Shrinking tends to bias one direction and stalls on ill-conditioned problems. In practice it becomes unreliable beyond n = 20, while remaining excellent at low dimension.

What makes CMA-ES so robust?

It adapts the covariance matrix C of its sampling distribution to the local structure of the objective, effectively imitating a quasi-Newton method without gradients. This makes it robust on ill-conditioned, non-separable problems and the most reliable metaheuristic at medium dimensions (n ≤ a few hundred).

How do DIRECT and BOBYQA differ?

DIRECT (Jones 1993) is a deterministic global optimiser that recursively partitions hyper-rectangles without needing a Lipschitz constant. BOBYQA (Powell 2009) is a local optimiser that builds a quadratic interpolation model and runs trust-region steps. They complement each other: DIRECT for global, BOBYQA for fine local refinement.

What is jSO an improvement over?

It refines L-SHADE, the history-based variant of differential evolution. jSO adds current-to-pbest/1 mutation, success-history adaptation of F and CR, linear population size reduction, and jSO-specific F/CR initialisation. It was a top performer in the CEC 2017 benchmarks.

Derivative-Free Optimization and Metaheuristics

Overview

Derivative-Free Optimization (DFO) solves $\min f(x)$ using only function evaluations. It applies when gradients are unavailable, analytically intractable, expensive to evaluate, or noisy. This article surveys deterministic direct search (Nelder-Mead, pattern search, DIRECT, BOBYQA) together with stochastic metaheuristics (SA, GA, PSO, DE, CMA-ES).

sangi exposes these algorithms in optimization_nd.hpp. Metaheuristics with box constraints $\mathbf{x} \in [l, u]$ use the bounded signature.

Nelder-Mead simplex

Nelder and Mead (1965) iteratively reshape a simplex $\{x_0, x_1, \dots, x_n\}$ of $n + 1$ vertices through four geometric moves. Vertices are sorted by function value ($f(x_0) \leq \cdots \leq f(x_n)$), and the worst vertex $x_n$ is improved against the centroid $\bar{x} = \frac{1}{n} \sum_{i=0}^{n-1} x_i$.

Move	Formula	Trigger
Reflection	$x_r = \bar{x} + \alpha (\bar{x} - x_n)$	$\alpha = 1$
Expansion	$x_e = \bar{x} + \gamma (x_r - \bar{x})$	$\gamma = 2$, if $x_r$ is best
Outer contraction	$x_c = \bar{x} + \rho (x_r - \bar{x})$	$\rho = 0.5$, if $x_r$ is mid-ranked
Inner contraction	$x_c = \bar{x} - \rho (x_r - \bar{x})$	$\rho = 0.5$, if $x_r$ is worst
Shrink	$x_i \leftarrow x_0 + \sigma (x_i - x_0)$	$\sigma = 0.5$, all above fail

Stopping criteria

Use a joint test: $\max_i f(x_i) - f(x_0) < \varepsilon_f$ and $\max_i \|x_i - x_0\| < \varepsilon_x$. Neither alone reliably detects stalling.

Caveats

Excellent at $n \leq 10$, deteriorates beyond $n \sim 20$.
Repeated shrinks on ill-conditioned problems cause stagnation; restart the simplex if necessary.
Convergence is not guaranteed in general (McKinnon 1998 gives a counter-example).

sangi nelderMeadMinimize exposes the coefficients via NelderMeadOptions.

Pattern search (Hooke-Jeeves, GPS, MADS)

Hooke-Jeeves

Hooke and Jeeves (1961) introduced the original pattern search, alternating two phases:

Exploratory move: probe each coordinate axis $e_i$ at $\pm \Delta$ and accept improvements.
Pattern move: take another step along the direction that improved.

If no $\pm \Delta$ improves, shrink $\Delta \leftarrow \Delta / 2$ (or $\Delta / \sqrt{2}$) and check convergence. sangi hookeJeevesMinimize shrinks by $\sqrt{2}$ and stops when $\Delta x < $ min_step.

GPS (Generalized Pattern Search)

Torczon (1997) generalised Hooke-Jeeves to arbitrary positive bases $D$ (every vector is a non-negative combination of $D$). Under smoothness GPS provably converges to a local minimum.

MADS (Mesh Adaptive Direct Search)

Audet and Dennis (2006) let the direction set change adaptively with the mesh size, recovering convergence to Clarke stationary points on non-smooth objectives. The NOMAD solver implements MADS for black-box engineering optimisation.

DIRECT and BOBYQA

DIRECT (DIvide a RECTangle)

Jones, Perttunen, and Stuckle (1993) split the search box into hyper-rectangles, sample the centre of each, and recursively subdivide the "potentially optimal" rectangles. No Lipschitz constant is required. DIRECT is one of the rare deterministic global optimisers, but the curse of dimensionality limits it to $n \leq 10$ in practice.

BOBYQA (Bound Optimization BY Quadratic Approximation)

Powell (2009) builds a quadratic interpolation model on $\binom{n+1}{2} + n + 1$ points and minimises it with a trust-region method. It handles box constraints and is the default choice for expensive engineering design problems with no gradient.

Related Powell-family solvers include COBYLA (linear interpolation with constraints), NEWUOA (unconstrained BOBYQA), and UOBYQA (the original quadratic-interpolation version). BOBYQA and COBYLA are not yet shipped in the current sangi release; they are on the roadmap.

Where Powell-family methods still shine

Even with Bayesian optimisation (Gaussian-process surrogate plus acquisition function) now dominant in expensive black-box settings, Powell-family quadratic interpolation remains strong on problems with no gradient, expensive evaluations, and $n \leq 30$.

Simulated annealing (SA)

Kirkpatrick, Gelatt, and Vecchi (1983) introduced SA by analogy with the annealing of metals. Worse-than-current proposals are accepted with a Metropolis probability that depends on temperature $T_k$:

$$P(\text{accept}) = \begin{cases} 1 & f(x') \leq f(x) \\ \exp\!\bigl(-(f(x') - f(x))/T_k\bigr) & f(x') > f(x) \end{cases}.$$

Cooling schedules

Exponential: $T_{k+1} = r \cdot T_k$ with $r \in [0.9, 0.999]$. Common in practice.
Logarithmic: $T_k = T_0 / \log(k + 2)$. Provably converges to the global optimum, but impractically slowly.
Adaptive: adjust $r$ based on the acceptance rate.

sangi simulatedAnnealingMinimize uses exponential cooling (default $r = 0.995$) with Gaussian neighbours whose standard deviation is a fraction (neighbor_scale = 0.1) of the box width.

Genetic algorithm (GA)

Holland (1975) introduced evolutionary computation through populations evolved via selection, crossover, mutation, and elitism.

Selection: tournament selection (pick the best of $k$ random members) is the standard.
Crossover: BLX-$\alpha$ (Blend Crossover) and SBX (Simulated Binary Crossover) for continuous variables. BLX-$\alpha$ samples uniformly from $[p_1 - \alpha d, p_2 + \alpha d]$ where $p_1 \leq p_2$ and $d = p_2 - p_1$.
Mutation: Gaussian $x' = x + \mathcal{N}(0, \sigma^2)$ or polynomial mutation.
Elitism: copy the top $k$ individuals into the next generation to prevent regression.

sangi geneticAlgorithmMinimize uses tournament selection, BLX-$0.5$ crossover, Gaussian mutation, and elitism (default 2). GA is general purpose but, on continuous problems, usually trails CMA-ES and DE; it shines when the encoding has discrete or combinatorial structure.

Particle swarm optimisation (PSO)

Kennedy and Eberhart (1995) update each particle's position $x_i$ and velocity $v_i$ from its personal best $p_i$ and the swarm-wide best $g$:

$$v_i \leftarrow w \, v_i + c_1 r_1 (p_i - x_i) + c_2 r_2 (g - x_i), \qquad x_i \leftarrow x_i + v_i.$$

Typical $w = 0.7$, $c_1, c_2 \in [1.5, 2.0]$, $r_1, r_2 \sim \mathcal{U}[0, 1]$. Easy to implement and embarrassingly parallel; the tendency to get stuck near a local optimum is usually mitigated by hybridisation.

sangi does not ship a standalone PSO in the current release; DE, CMA-ES, and jSO cover overlapping use cases.

Differential evolution (DE, jSO)

DE (Storn-Price 1997)

DE/rand/1/bin builds trial vectors with a simple difference mutation:

$$v_i = x_{r_0} + F (x_{r_1} - x_{r_2}), \quad r_0, r_1, r_2 \text{ pairwise distinct}.$$

Binomial crossover yields $u_i$, accepted greedily ($x_i \leftarrow u_i$) when $f(u_i) < f(x_i)$. Typical $F = 0.8$, $CR = 0.9$. Low overhead and surprisingly competitive on continuous global problems.

sangi differentialEvolutionMinimize implements DE/rand/1/bin.

L-SHADE and jSO

SHADE (Tanabe and Fukunaga 2013) keeps a history of successful $(F, CR)$ values and resamples adaptively. L-SHADE adds linear population size reduction (CEC 2014 winner). jSO (Brest et al. 2017) layers a different $F/CR$ initialisation and the current-to-pbest/1 mutation on top of L-SHADE, finishing near the top of CEC 2017.

sangi jsoMinimize implements jSO. It is one of the most reliable metaheuristics shipped, alongside CMA-ES.

CMA-ES

Hansen and Ostermeier (2001) sample $\lambda$ points from $\mathcal{N}(m_k, \sigma_k^2 C_k)$, select the top $\mu_w$, and update $m, \sigma, C$ from the weighted offspring.

Update outline

Sample $\lambda$ candidates: $x_i \sim \mathcal{N}(m_k, \sigma_k^2 C_k)$.
Sort by function value; compute the weighted mean $m_{k+1} = \sum_{i=1}^{\mu_w} w_i x_{i:\lambda}$.
Update evolution paths $p_\sigma$ and $p_c$ with cumulative information.
Update $C_{k+1}$ via combined rank-1 and rank-$\mu_w$ updates.
Update $\sigma_{k+1}$ via cumulative step-size adaptation (CSA) from $\|p_\sigma\|$.

$C$ approaches the inverse Hessian on quadratic objectives, so CMA-ES behaves quasi-Newton-like without derivatives. Strong on ill-conditioned, non-separable problems. Default $\lambda = 4 + \lfloor 3 \ln n \rfloor$ scales gracefully up to a few hundred dimensions.

sangi cmaesMinimize implements the standard $(\mu/\mu_w, \lambda)$-CMA-ES with condition_limit = 10^{14} to keep $C$ in check.

Noisy and non-smooth objectives

When evaluations are noisy (Monte Carlo simulations, hardware measurements, stochastic simulators):

Re-evaluate: average several evaluations of the same $x$. Sample count typically grows with the generation.
Noise-robust algorithms: population-based methods like CMA-ES average noise implicitly. Nelder-Mead is fragile under noise.
Loose tolerance: set the stopping tolerance above the noise floor.

For non-smooth objectives (piecewise-linear, discontinuous, $\max/\min$):

MADS / GPS converge to Clarke stationary points under modest assumptions.
Nelder-Mead often works but lacks guarantees.
Metaheuristics (SA, GA, DE, CMA-ES) run without continuity, with only probabilistic guarantees.

When the non-smoothness is known (e.g. $\max(0, x)$), smoothing into $\log(1 + e^x)$ and switching back to a gradient method is often the fastest path.

Selection guide

Situation	Recommended (sangi)
$n \leq 10$, smooth, cheap evaluations	`nelderMeadMinimize`
$n \leq 30$, cheap evaluations	`hookeJeevesMinimize`
$n \leq $ a few hundred, ill conditioned	`cmaesMinimize`
$n \leq $ a few hundred, benchmark-strength solver	`jsoMinimize`
General continuous global search	`differentialEvolutionMinimize`
Discrete / combinatorial structure	`geneticAlgorithmMinimize`
Need to escape local minima	`simulatedAnnealingMinimize`
Very expensive evaluations (> seconds each)	Bayesian optimisation (external) or BOBYQA (planned)

References

Conn, A. R., Scheinberg, K., and Vicente, L. N. (2009). Introduction to Derivative-Free Optimization. SIAM.
Nelder, J. A. and Mead, R. (1965). "A Simplex Method for Function Minimization". Computer Journal, 7, 308–313.
Hansen, N. and Ostermeier, A. (2001). "Completely Derandomized Self-Adaptation in Evolution Strategies". Evolutionary Computation, 9, 159–195.
Storn, R. and Price, K. (1997). "Differential Evolution — A Simple and Efficient Heuristic for global Optimization over Continuous Spaces". J. Global Optimization, 11, 341–359.
Brest, J., Maučec, M. S., and Bošković, B. (2017). "Single objective real-parameter optimization: Algorithm jSO". CEC 2017.
Audet, C. and Dennis, J. E. (2006). "Mesh Adaptive Direct Search Algorithms for Constrained Optimization". SIAM J. Optim., 17, 188–217.