Derivative-Free Optimization and Metaheuristics

Overview

Derivative-Free Optimization (DFO) solves $\min f(x)$ using only function evaluations. It applies when gradients are unavailable, analytically intractable, expensive to evaluate, or noisy. This article surveys deterministic direct search (Nelder-Mead, pattern search, DIRECT, BOBYQA) together with stochastic metaheuristics (SA, GA, PSO, DE, CMA-ES).

sangi exposes these algorithms in optimization_nd.hpp. Metaheuristics with box constraints $\mathbf{x} \in [l, u]$ use the bounded signature.

Nelder-Mead simplex

Nelder and Mead (1965) iteratively reshape a simplex $\{x_0, x_1, \dots, x_n\}$ of $n + 1$ vertices through four geometric moves. Vertices are sorted by function value ($f(x_0) \leq \cdots \leq f(x_n)$), and the worst vertex $x_n$ is improved against the centroid $\bar{x} = \frac{1}{n} \sum_{i=0}^{n-1} x_i$.

MoveFormulaTrigger
Reflection$x_r = \bar{x} + \alpha (\bar{x} - x_n)$$\alpha = 1$
Expansion$x_e = \bar{x} + \gamma (x_r - \bar{x})$$\gamma = 2$, if $x_r$ is best
Outer contraction$x_c = \bar{x} + \rho (x_r - \bar{x})$$\rho = 0.5$, if $x_r$ is mid-ranked
Inner contraction$x_c = \bar{x} - \rho (x_r - \bar{x})$$\rho = 0.5$, if $x_r$ is worst
Shrink$x_i \leftarrow x_0 + \sigma (x_i - x_0)$$\sigma = 0.5$, all above fail

Stopping criteria

Use a joint test: $\max_i f(x_i) - f(x_0) < \varepsilon_f$ and $\max_i \|x_i - x_0\| < \varepsilon_x$. Neither alone reliably detects stalling.

Caveats

  • Excellent at $n \leq 10$, deteriorates beyond $n \sim 20$.
  • Repeated shrinks on ill-conditioned problems cause stagnation; restart the simplex if necessary.
  • Convergence is not guaranteed in general (McKinnon 1998 gives a counter-example).

sangi nelderMeadMinimize exposes the coefficients via NelderMeadOptions.

DIRECT and BOBYQA

DIRECT (DIvide a RECTangle)

Jones, Perttunen, and Stuckle (1993) split the search box into hyper-rectangles, sample the centre of each, and recursively subdivide the "potentially optimal" rectangles. No Lipschitz constant is required. DIRECT is one of the rare deterministic global optimisers, but the curse of dimensionality limits it to $n \leq 10$ in practice.

BOBYQA (Bound Optimization BY Quadratic Approximation)

Powell (2009) builds a quadratic interpolation model on $\binom{n+1}{2} + n + 1$ points and minimises it with a trust-region method. It handles box constraints and is the default choice for expensive engineering design problems with no gradient.

Related Powell-family solvers include COBYLA (linear interpolation with constraints), NEWUOA (unconstrained BOBYQA), and UOBYQA (the original quadratic-interpolation version). BOBYQA and COBYLA are not yet shipped in the current sangi release; they are on the roadmap.

Where Powell-family methods still shine

Even with Bayesian optimisation (Gaussian-process surrogate plus acquisition function) now dominant in expensive black-box settings, Powell-family quadratic interpolation remains strong on problems with no gradient, expensive evaluations, and $n \leq 30$.

Simulated annealing (SA)

Kirkpatrick, Gelatt, and Vecchi (1983) introduced SA by analogy with the annealing of metals. Worse-than-current proposals are accepted with a Metropolis probability that depends on temperature $T_k$:

$$P(\text{accept}) = \begin{cases} 1 & f(x') \leq f(x) \\ \exp\!\bigl(-(f(x') - f(x))/T_k\bigr) & f(x') > f(x) \end{cases}.$$

Cooling schedules

  • Exponential: $T_{k+1} = r \cdot T_k$ with $r \in [0.9, 0.999]$. Common in practice.
  • Logarithmic: $T_k = T_0 / \log(k + 2)$. Provably converges to the global optimum, but impractically slowly.
  • Adaptive: adjust $r$ based on the acceptance rate.

sangi simulatedAnnealingMinimize uses exponential cooling (default $r = 0.995$) with Gaussian neighbours whose standard deviation is a fraction (neighbor_scale = 0.1) of the box width.

Genetic algorithm (GA)

Holland (1975) introduced evolutionary computation through populations evolved via selection, crossover, mutation, and elitism.

  • Selection: tournament selection (pick the best of $k$ random members) is the standard.
  • Crossover: BLX-$\alpha$ (Blend Crossover) and SBX (Simulated Binary Crossover) for continuous variables. BLX-$\alpha$ samples uniformly from $[p_1 - \alpha d, p_2 + \alpha d]$ where $p_1 \leq p_2$ and $d = p_2 - p_1$.
  • Mutation: Gaussian $x' = x + \mathcal{N}(0, \sigma^2)$ or polynomial mutation.
  • Elitism: copy the top $k$ individuals into the next generation to prevent regression.

sangi geneticAlgorithmMinimize uses tournament selection, BLX-$0.5$ crossover, Gaussian mutation, and elitism (default 2). GA is general purpose but, on continuous problems, usually trails CMA-ES and DE; it shines when the encoding has discrete or combinatorial structure.

Particle swarm optimisation (PSO)

Kennedy and Eberhart (1995) update each particle's position $x_i$ and velocity $v_i$ from its personal best $p_i$ and the swarm-wide best $g$:

$$v_i \leftarrow w \, v_i + c_1 r_1 (p_i - x_i) + c_2 r_2 (g - x_i), \qquad x_i \leftarrow x_i + v_i.$$

Typical $w = 0.7$, $c_1, c_2 \in [1.5, 2.0]$, $r_1, r_2 \sim \mathcal{U}[0, 1]$. Easy to implement and embarrassingly parallel; the tendency to get stuck near a local optimum is usually mitigated by hybridisation.

sangi does not ship a standalone PSO in the current release; DE, CMA-ES, and jSO cover overlapping use cases.

Differential evolution (DE, jSO)

DE (Storn-Price 1997)

DE/rand/1/bin builds trial vectors with a simple difference mutation:

$$v_i = x_{r_0} + F (x_{r_1} - x_{r_2}), \quad r_0, r_1, r_2 \text{ pairwise distinct}.$$

Binomial crossover yields $u_i$, accepted greedily ($x_i \leftarrow u_i$) when $f(u_i) < f(x_i)$. Typical $F = 0.8$, $CR = 0.9$. Low overhead and surprisingly competitive on continuous global problems.

sangi differentialEvolutionMinimize implements DE/rand/1/bin.

L-SHADE and jSO

SHADE (Tanabe and Fukunaga 2013) keeps a history of successful $(F, CR)$ values and resamples adaptively. L-SHADE adds linear population size reduction (CEC 2014 winner). jSO (Brest et al. 2017) layers a different $F/CR$ initialisation and the current-to-pbest/1 mutation on top of L-SHADE, finishing near the top of CEC 2017.

sangi jsoMinimize implements jSO. It is one of the most reliable metaheuristics shipped, alongside CMA-ES.

CMA-ES

Hansen and Ostermeier (2001) sample $\lambda$ points from $\mathcal{N}(m_k, \sigma_k^2 C_k)$, select the top $\mu_w$, and update $m, \sigma, C$ from the weighted offspring.

Update outline

  1. Sample $\lambda$ candidates: $x_i \sim \mathcal{N}(m_k, \sigma_k^2 C_k)$.
  2. Sort by function value; compute the weighted mean $m_{k+1} = \sum_{i=1}^{\mu_w} w_i x_{i:\lambda}$.
  3. Update evolution paths $p_\sigma$ and $p_c$ with cumulative information.
  4. Update $C_{k+1}$ via combined rank-1 and rank-$\mu_w$ updates.
  5. Update $\sigma_{k+1}$ via cumulative step-size adaptation (CSA) from $\|p_\sigma\|$.

$C$ approaches the inverse Hessian on quadratic objectives, so CMA-ES behaves quasi-Newton-like without derivatives. Strong on ill-conditioned, non-separable problems. Default $\lambda = 4 + \lfloor 3 \ln n \rfloor$ scales gracefully up to a few hundred dimensions.

sangi cmaesMinimize implements the standard $(\mu/\mu_w, \lambda)$-CMA-ES with condition_limit = 10^{14} to keep $C$ in check.

Noisy and non-smooth objectives

When evaluations are noisy (Monte Carlo simulations, hardware measurements, stochastic simulators):

  • Re-evaluate: average several evaluations of the same $x$. Sample count typically grows with the generation.
  • Noise-robust algorithms: population-based methods like CMA-ES average noise implicitly. Nelder-Mead is fragile under noise.
  • Loose tolerance: set the stopping tolerance above the noise floor.

For non-smooth objectives (piecewise-linear, discontinuous, $\max/\min$):

  • MADS / GPS converge to Clarke stationary points under modest assumptions.
  • Nelder-Mead often works but lacks guarantees.
  • Metaheuristics (SA, GA, DE, CMA-ES) run without continuity, with only probabilistic guarantees.

When the non-smoothness is known (e.g. $\max(0, x)$), smoothing into $\log(1 + e^x)$ and switching back to a gradient method is often the fastest path.

Selection guide

SituationRecommended (sangi)
$n \leq 10$, smooth, cheap evaluationsnelderMeadMinimize
$n \leq 30$, cheap evaluationshookeJeevesMinimize
$n \leq $ a few hundred, ill conditionedcmaesMinimize
$n \leq $ a few hundred, benchmark-strength solverjsoMinimize
General continuous global searchdifferentialEvolutionMinimize
Discrete / combinatorial structuregeneticAlgorithmMinimize
Need to escape local minimasimulatedAnnealingMinimize
Very expensive evaluations (> seconds each)Bayesian optimisation (external) or BOBYQA (planned)

References

  • Conn, A. R., Scheinberg, K., and Vicente, L. N. (2009). Introduction to Derivative-Free Optimization. SIAM.
  • Nelder, J. A. and Mead, R. (1965). "A Simplex Method for Function Minimization". Computer Journal, 7, 308–313.
  • Hansen, N. and Ostermeier, A. (2001). "Completely Derandomized Self-Adaptation in Evolution Strategies". Evolutionary Computation, 9, 159–195.
  • Storn, R. and Price, K. (1997). "Differential Evolution — A Simple and Efficient Heuristic for global Optimization over Continuous Spaces". J. Global Optimization, 11, 341–359.
  • Brest, J., Maučec, M. S., and Bošković, B. (2017). "Single objective real-parameter optimization: Algorithm jSO". CEC 2017.
  • Audet, C. and Dennis, J. E. (2006). "Mesh Adaptive Direct Search Algorithms for Constrained Optimization". SIAM J. Optim., 17, 188–217.