Optimization Glossary

Glossary of Mathematical Optimization

This glossary organizes key terms that frequently appear in optimization theory, grouped by topic. Each term includes a definition and links to related pages.

Fundamentals

Objective function
The function $f(\mathbf{x})$ to be minimized (or maximized). Also called a cost function or loss function.
Feasible region
The set of all points that satisfy every constraint. Also called the feasible set.
Global optimum
A point that minimizes the objective function over the entire feasible region.
Local optimum
A point that minimizes the objective function within some neighborhood. It is not necessarily a global optimum.
Gradient
The vector of partial derivatives of a multivariate function: $\nabla f(x) = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right)^T$. It points in the direction of steepest ascent. → Introduction: Gradient
Hessian matrix
The symmetric matrix of second-order partial derivatives: $\nabla^2 f(x) = \left[\frac{\partial^2 f}{\partial x_i \partial x_j}\right]$. It captures curvature information and is used in second-order optimality conditions and Newton's method. → Intermediate: Newton's Method
Jacobian matrix
The $m \times n$ matrix of first-order partial derivatives of a vector-valued function $\mathbf{F}: \mathbb{R}^n \to \mathbb{R}^m$. The gradient is the transpose of the Jacobian of a scalar function.
Stationary point
A point satisfying $\nabla f(x^*) = 0$. It may be a local minimum, a local maximum, or a saddle point.
Saddle point
A stationary point that is a local minimum in some directions and a local maximum in others. It corresponds to the case where the Hessian is indefinite (having both positive and negative eigenvalues).

Convex Analysis

Convex set
A set in which the line segment between any two points lies within the set: $\theta x + (1-\theta)y \in C$ ($0 \le \theta \le 1$). → Introduction: Convex Sets and Convex Functions
Convex function
A function satisfying $f(\theta x + (1-\theta)y) \le \theta f(x) + (1-\theta)f(y)$. Any local optimum is also a global optimum.
Strongly convex function
A function satisfying $f(y) \ge f(x) + \nabla f(x)^T(y-x) + \frac{m}{2}\|y-x\|^2$ for some parameter $m > 0$. It guarantees the existence of a unique global minimizer and ensures convergence rate bounds for gradient descent.
Convex cone
A set that is convex and satisfies $\alpha x \in C$ for all $x \in C$ and $\alpha \ge 0$. The set of positive semidefinite matrices $\mathbb{S}^n_+$ is a representative example.
Epigraph
The region above the graph of a function $f$: $\text{epi}\,f = \{(x, t) \mid f(x) \le t\}$. $f$ is convex $\Leftrightarrow$ $\text{epi}\,f$ is a convex set. It is a fundamental tool in convex analysis.
Subgradient
For a non-smooth convex function $f$, a vector $g$ satisfying $f(y) \ge f(x) + g^T(y-x)$ ($\forall y$). At differentiable points, the subgradient coincides with the gradient. The subdifferential $\partial f(x)$ is the set of all subgradients.
Conjugate function (Fenchel conjugate)
$f^*(y) = \sup_x \{y^T x - f(x)\}$. A dual representation of a convex function that forms the basis of duality theory.
Proximal operator
$\text{prox}_f(x) = \arg\min_u \left\{f(u) + \frac{1}{2}\|u - x\|^2\right\}$. It is used in optimization problems with non-smooth regularization terms (ISTA, ADMM, etc.).
Lipschitz continuity
A function $f$ is said to be $L$-Lipschitz continuous if there exists a constant $L > 0$ such that $\|f(x) - f(y)\| \le L\|x - y\|$. Lipschitz continuity of the gradient is important in convergence rate analysis.

Optimality Conditions

First-order necessary condition
For an unconstrained problem, if $x^*$ is a local minimizer, then $\nabla f(x^*) = 0$.
Second-order sufficient condition
If $\nabla f(x^*) = 0$ and $\nabla^2 f(x^*) \succ 0$ (positive definite), then $x^*$ is a strict local minimizer.
KKT conditions (Karush-Kuhn-Tucker conditions)
First-order necessary conditions for constrained optimization. They consist of four conditions:
  1. Stationarity: $\nabla f + \sum_i \lambda_i \nabla g_i + \sum_j \mu_j \nabla h_j = 0$
  2. Primal feasibility: $g_i(x) \le 0,\; h_j(x) = 0$
  3. Dual feasibility: $\lambda_i \ge 0$
  4. Complementary slackness: $\lambda_i g_i(x) = 0$
For convex problems, they are also sufficient. → Basic: KKT Conditions
Constraint qualification
A regularity condition on the constraints under which the KKT conditions are necessary. Examples include the Linear Independence Constraint Qualification (LICQ) and Slater's condition.
Complementary slackness
$\lambda_i g_i(x^*) = 0$: at the optimal solution, if an inequality constraint is not active ($g_i < 0$), then its multiplier is $\lambda_i = 0$. Conversely, if $\lambda_i > 0$, the constraint holds with equality (active constraint).

Duality Theory

Lagrangian
$L(x, \lambda, \mu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \mu_j h_j(x)$. A tool for incorporating constraints into the objective function. → Introduction: Lagrange Method
Lagrange multiplier
Dual variables $\lambda_i, \mu_j$ corresponding to the constraints. They represent the rate of change of the optimal objective value when the constraint is slightly relaxed (shadow price).
Dual problem
$\max_{\lambda \ge 0,\, \mu} \min_x L(x, \lambda, \mu)$. It provides a lower bound on the primal problem. → Basic: Duality Theory
Weak duality
The optimal value of the dual problem $d^*$ is at most the optimal value of the primal problem $p^*$: $d^* \le p^*$. This always holds.
Strong duality
$d^* = p^*$ (the duality gap is zero). It holds for convex problems satisfying Slater's condition.
Duality gap
$\eta = p^* - d^* \ge 0$. The difference between the optimal values of the primal and dual problems. It is used for convergence checks and stopping criteria. Under strong duality, $\eta = 0$.
Slater's condition
For a convex problem, it holds when there exists an interior point that strictly satisfies all inequality constraints. It is the most widely used sufficient condition for strong duality.
Fenchel duality
A dual transformation using conjugate functions. The dual of $\min_x \{f(x) + g(Ax)\}$ is $\max_y \{-f^*(-A^Ty) - g^*(y)\}$. It provides the theoretical foundation for decomposition algorithms such as ADMM.

Algorithms

Gradient descent
$x_{k+1} = x_k - \alpha_k \nabla f(x_k)$. The most fundamental iterative optimization method. → Intermediate: Gradient Descent
Stochastic gradient descent (SGD)
Uses gradient estimates from mini-batches. Suitable for large-scale data. → Intermediate: SGD
Newton's method
$x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1}\nabla f(x_k)$. Achieves quadratic convergence but requires computation and inversion of the Hessian matrix. → Intermediate: Newton's Method
Quasi-Newton method
A method that approximates the Hessian matrix from gradient information without computing it directly. BFGS and L-BFGS are representative examples. Achieves superlinear convergence.
Conjugate gradient method
A method that sequentially selects search directions conjugate to the previous ones. It solves an $n$-dimensional quadratic function in at most $n$ steps. → Intermediate: Conjugate Gradient
Trust region method
A method that minimizes a quadratic model within a neighborhood (trust region) of the current point. It has more robust convergence properties than line search methods. → Intermediate: Trust Region
Nelder-Mead method
A derivative-free simplex-based direct search method. Effective for low-dimensional non-smooth problems. → Intermediate: Nelder-Mead
Direct search method
A general term for optimization methods that use only function values without gradients. Examples include pattern search and MADS. → Intermediate: Direct Search
Interior point method
A method that follows a path through the interior of the feasible region. It uses a barrier function and can solve linear programs in polynomial time. → Intermediate: Interior Point Method
Step size (learning rate)
The magnitude of the update $\alpha_k$ in an iterative method. Too large causes divergence; too small causes slow convergence. Appropriate values are chosen via the Armijo condition or Wolfe conditions.
Line search
After the search direction $d_k$ is determined, a procedure to approximately find $\alpha_k = \arg\min_\alpha f(x_k + \alpha d_k)$.
Convergence rate
The speed at which an iterative method approaches the optimal solution. Classified as linear ($\|x_{k+1} - x^*\| \le c\|x_k - x^*\|$), superlinear, or quadratic ($\|x_{k+1} - x^*\| \le c\|x_k - x^*\|^2$).

Constrained Optimization

Active constraint
An inequality constraint that holds with equality at the optimal solution $x^*$: $g_i(x^*) = 0$. The collection of active constraints is called the active set.
Barrier function
A function that goes to infinity at the boundary of the feasible region. The logarithmic barrier $-\sum_i \log(-g_i(x))$ is the most common example. It is used in interior point methods.
Penalty method
A method that imposes a penalty $\rho\sum_i \max(0, g_i(x))^2$ for constraint violations. As $\rho \to \infty$, it converges to the solution of the constrained problem.
Augmented Lagrangian method
A method that adds a penalty term to the Lagrangian: $L_\rho(x,\lambda) = L(x,\lambda) + \frac{\rho}{2}\sum_i g_i(x)^2$. Numerically more stable than the penalty method.
ADMM (Alternating Direction Method of Multipliers)
A variant of the augmented Lagrangian method that splits the problem into separable subproblems and solves them alternately. Widely used in large-scale distributed optimization.
Simplex method
An algorithm that solves linear programming problems by traversing the vertices of a polyhedron. Although its worst-case complexity is exponential, it is very fast in practice. → Basic: Simplex Method
Slack variable
A nonnegative variable that converts an inequality constraint $g_i(x) \le 0$ into an equality constraint $g_i(x) + s_i = 0,\; s_i \ge 0$. → Basic: Slack Variables

Special Problem Classes

Linear programming (LP)
An optimization problem where the objective function and all constraints are linear. → Basic: Linear Programming
Quadratic programming (QP)
A problem with a quadratic objective function and linear constraints. → Basic: Quadratic Programming
Semidefinite programming (SDP)
A convex optimization problem where the variable is a positive semidefinite matrix $X \succeq 0$. Used in combinatorial optimization relaxations, control theory, and quantum information.
Second-order cone programming (SOCP)
A convex problem with constraints of the form $\|Ax + b\| \le c^Tx + d$. It lies between LP and SDP. Used in robust optimization and portfolio optimization.
DC programming
A problem where the objective function is expressed as the difference of two convex functions: $f(x) = g(x) - h(x)$. The DCA (DC Algorithm) finds local solutions. It encompasses a broad subclass of nonconvex problems.
Integer programming (IP)
A problem where some or all variables are constrained to be integers. Generally NP-hard. → Advanced: Integer Programming
Combinatorial optimization
A problem of finding an optimal solution from a discrete set of candidates. The traveling salesman problem (TSP) and the knapsack problem are representative examples. → Advanced: Combinatorial Optimization
Multi-objective optimization
A problem of simultaneously optimizing multiple objective functions. The goal is to find the set of Pareto optimal solutions. → Advanced: Multi-Objective Optimization
Robust optimization
Optimization that guarantees worst-case performance against parameter uncertainty. → Advanced: Robust Optimization
Stochastic optimization
An optimization problem where the objective function or constraints involve random variables. Includes expected value optimization and chance constraints. → Advanced: Optimization Under Uncertainty

Metaheuristics

Genetic algorithm (GA)
An optimization method inspired by biological evolution. It improves solutions through repeated selection, crossover, and mutation. → Advanced: Genetic Algorithm
Simulated annealing (SA)
A method inspired by the annealing process of metals. It stochastically explores solutions while lowering the temperature, enabling escape from local optima. → Advanced: Simulated Annealing
Particle swarm optimization (PSO)
A method inspired by the flocking behavior of birds. Each particle searches while being attracted to both its personal best and the global best of the swarm. → Advanced: Swarm Intelligence
Differential evolution (DE)
An evolutionary method that generates new candidate solutions using difference vectors between individuals. It is easy to tune and highly practical. → Advanced: Evolutionary Strategies
CMA-ES (Covariance Matrix Adaptation Evolution Strategy)
An evolution strategy that adaptively updates the parameters (mean and covariance matrix) of a multivariate normal distribution. It is a standard method for black-box optimization. → Advanced: Evolutionary Strategies
Bayesian optimization
A method that approximates the objective function with a Gaussian process and selects the next evaluation point using an acquisition function. Effective for optimizing expensive-to-evaluate functions. → Advanced: Bayesian Optimization
Tabu search
A method that records recently visited solutions in a tabu list and forbids revisiting them while performing local search. Widely used in combinatorial optimization.

References