What is the difference between a convex function and a strongly convex function?

A convex function satisfies f(θx+(1-θ)y) ≤ θf(x)+(1-θ)f(y). A strongly convex function additionally satisfies f(y) ≥ f(x)+∇f(x)ᵀ(y-x)+(m/2)||y-x||² (m>0), which guarantees the existence of a unique global minimizer.

What are the KKT conditions?

The Karush-Kuhn-Tucker conditions are first-order necessary conditions that an optimal solution of a constrained optimization problem must satisfy. They consist of four conditions: stationarity, primal feasibility, dual feasibility, and complementary slackness. For convex problems, the KKT conditions are also sufficient.

The epigraph of a function f is the region above its graph: epi f = {(x,t) | f(x) ≤ t}. A function f is convex if and only if its epigraph is a convex set. It is a fundamental tool in convex analysis.

Optimization Glossary | Mathematics Notes

Fundamentals

Objective function: The function $f(\mathbf{x})$ to be minimized (or maximized). Also called a cost function or loss function.
Feasible region: The set of all points that satisfy every constraint. Also called the feasible set.
Global optimum: A point that minimizes the objective function over the entire feasible region.
Local optimum: A point that minimizes the objective function within some neighborhood. It is not necessarily a global optimum.
Gradient: The vector of partial derivatives of a multivariate function: $\nabla f(x) = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right)^T$. It points in the direction of steepest ascent. → Introduction: Gradient
Hessian matrix: The symmetric matrix of second-order partial derivatives: $\nabla^2 f(x) = \left[\frac{\partial^2 f}{\partial x_i \partial x_j}\right]$. It captures curvature information and is used in second-order optimality conditions and Newton's method. → Intermediate: Newton's Method
Jacobian matrix: The $m \times n$ matrix of first-order partial derivatives of a vector-valued function $\mathbf{F}: \mathbb{R}^n \to \mathbb{R}^m$. The gradient is the transpose of the Jacobian of a scalar function.
Stationary point: A point satisfying $\nabla f(x^*) = 0$. It may be a local minimum, a local maximum, or a saddle point.
Saddle point: A stationary point that is a local minimum in some directions and a local maximum in others. It corresponds to the case where the Hessian is indefinite (having both positive and negative eigenvalues).

Convex Analysis

Convex set: A set in which the line segment between any two points lies within the set: $\theta x + (1-\theta)y \in C$ ($0 \le \theta \le 1$). → Introduction: Convex Sets and Convex Functions
Convex function: A function satisfying $f(\theta x + (1-\theta)y) \le \theta f(x) + (1-\theta)f(y)$. Any local optimum is also a global optimum.
Strongly convex function: A function satisfying $f(y) \ge f(x) + \nabla f(x)^T(y-x) + \frac{m}{2}\|y-x\|^2$ for some parameter $m > 0$. It guarantees the existence of a unique global minimizer and ensures convergence rate bounds for gradient descent.
Convex cone: A set that is convex and satisfies $\alpha x \in C$ for all $x \in C$ and $\alpha \ge 0$. The set of positive semidefinite matrices $\mathbb{S}^n_+$ is a representative example.
Epigraph: The region above the graph of a function $f$: $\text{epi}\,f = \{(x, t) \mid f(x) \le t\}$. $f$ is convex $\Leftrightarrow$ $\text{epi}\,f$ is a convex set. It is a fundamental tool in convex analysis.
Subgradient: For a non-smooth convex function $f$, a vector $g$ satisfying $f(y) \ge f(x) + g^T(y-x)$ ($\forall y$). At differentiable points, the subgradient coincides with the gradient. The subdifferential $\partial f(x)$ is the set of all subgradients.
Conjugate function (Fenchel conjugate): $f^*(y) = \sup_x \{y^T x - f(x)\}$. A dual representation of a convex function that forms the basis of duality theory.
Proximal operator: $\text{prox}_f(x) = \arg\min_u \left\{f(u) + \frac{1}{2}\|u - x\|^2\right\}$. It is used in optimization problems with non-smooth regularization terms (ISTA, ADMM, etc.).
Lipschitz continuity: A function $f$ is said to be $L$-Lipschitz continuous if there exists a constant $L > 0$ such that $\|f(x) - f(y)\| \le L\|x - y\|$. Lipschitz continuity of the gradient is important in convergence rate analysis.

Optimality Conditions

First-order necessary condition

For an unconstrained problem, if $x^*$ is a local minimizer, then $\nabla f(x^*) = 0$.

Second-order sufficient condition

If $\nabla f(x^*) = 0$ and $\nabla^2 f(x^*) \succ 0$ (positive definite), then $x^*$ is a strict local minimizer.

KKT conditions (Karush-Kuhn-Tucker conditions)

First-order necessary conditions for constrained optimization. They consist of four conditions:

Stationarity: $\nabla f + \sum_i \lambda_i \nabla g_i + \sum_j \mu_j \nabla h_j = 0$
Primal feasibility: $g_i(x) \le 0,\; h_j(x) = 0$
Dual feasibility: $\lambda_i \ge 0$
Complementary slackness: $\lambda_i g_i(x) = 0$

For convex problems, they are also sufficient. → Basic: KKT Conditions

Constraint qualification

A regularity condition on the constraints under which the KKT conditions are necessary. Examples include the Linear Independence Constraint Qualification (LICQ) and Slater's condition.

Complementary slackness

$\lambda_i g_i(x^*) = 0$: at the optimal solution, if an inequality constraint is not active ($g_i < 0$), then its multiplier is $\lambda_i = 0$. Conversely, if $\lambda_i > 0$, the constraint holds with equality (active constraint).

Duality Theory

Lagrangian: $L(x, \lambda, \mu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \mu_j h_j(x)$. A tool for incorporating constraints into the objective function. → Introduction: Lagrange Method
Lagrange multiplier: Dual variables $\lambda_i, \mu_j$ corresponding to the constraints. They represent the rate of change of the optimal objective value when the constraint is slightly relaxed (shadow price).
Dual problem: $\max_{\lambda \ge 0,\, \mu} \min_x L(x, \lambda, \mu)$. It provides a lower bound on the primal problem. → Basic: Duality Theory
Weak duality: The optimal value of the dual problem $d^*$ is at most the optimal value of the primal problem $p^*$: $d^* \le p^*$. This always holds.
Strong duality: $d^* = p^*$ (the duality gap is zero). It holds for convex problems satisfying Slater's condition.
Duality gap: $\eta = p^* - d^* \ge 0$. The difference between the optimal values of the primal and dual problems. It is used for convergence checks and stopping criteria. Under strong duality, $\eta = 0$.
Slater's condition: For a convex problem, it holds when there exists an interior point that strictly satisfies all inequality constraints. It is the most widely used sufficient condition for strong duality.
Fenchel duality: A dual transformation using conjugate functions. The dual of $\min_x \{f(x) + g(Ax)\}$ is $\max_y \{-f^*(-A^Ty) - g^*(y)\}$. It provides the theoretical foundation for decomposition algorithms such as ADMM.

Algorithms

Gradient descent: $x_{k+1} = x_k - \alpha_k \nabla f(x_k)$. The most fundamental iterative optimization method. → Intermediate: Gradient Descent
Stochastic gradient descent (SGD): Uses gradient estimates from mini-batches. Suitable for large-scale data. → Intermediate: SGD
Newton's method: $x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1}\nabla f(x_k)$. Achieves quadratic convergence but requires computation and inversion of the Hessian matrix. → Intermediate: Newton's Method
Quasi-Newton method: A method that approximates the Hessian matrix from gradient information without computing it directly. BFGS and L-BFGS are representative examples. Achieves superlinear convergence.
Conjugate gradient method: A method that sequentially selects search directions conjugate to the previous ones. It solves an $n$-dimensional quadratic function in at most $n$ steps. → Intermediate: Conjugate Gradient
Trust region method: A method that minimizes a quadratic model within a neighborhood (trust region) of the current point. It has more robust convergence properties than line search methods. → Intermediate: Trust Region
Nelder-Mead method: A derivative-free simplex-based direct search method. Effective for low-dimensional non-smooth problems. → Intermediate: Nelder-Mead
Direct search method: A general term for optimization methods that use only function values without gradients. Examples include pattern search and MADS. → Intermediate: Direct Search
Interior point method: A method that follows a path through the interior of the feasible region. It uses a barrier function and can solve linear programs in polynomial time. → Intermediate: Interior Point Method
Step size (learning rate): The magnitude of the update $\alpha_k$ in an iterative method. Too large causes divergence; too small causes slow convergence. Appropriate values are chosen via the Armijo condition or Wolfe conditions.
Line search: After the search direction $d_k$ is determined, a procedure to approximately find $\alpha_k = \arg\min_\alpha f(x_k + \alpha d_k)$.
Convergence rate: The speed at which an iterative method approaches the optimal solution. Classified as linear ($\|x_{k+1} - x^*\| \le c\|x_k - x^*\|$), superlinear, or quadratic ($\|x_{k+1} - x^*\| \le c\|x_k - x^*\|^2$).

Constrained Optimization

Active constraint: An inequality constraint that holds with equality at the optimal solution $x^*$: $g_i(x^*) = 0$. The collection of active constraints is called the active set.
Barrier function: A function that goes to infinity at the boundary of the feasible region. The logarithmic barrier $-\sum_i \log(-g_i(x))$ is the most common example. It is used in interior point methods.
Penalty method: A method that imposes a penalty $\rho\sum_i \max(0, g_i(x))^2$ for constraint violations. As $\rho \to \infty$, it converges to the solution of the constrained problem.
Augmented Lagrangian method: A method that adds a penalty term to the Lagrangian: $L_\rho(x,\lambda) = L(x,\lambda) + \frac{\rho}{2}\sum_i g_i(x)^2$. Numerically more stable than the penalty method.
ADMM (Alternating Direction Method of Multipliers): A variant of the augmented Lagrangian method that splits the problem into separable subproblems and solves them alternately. Widely used in large-scale distributed optimization.
Simplex method: An algorithm that solves linear programming problems by traversing the vertices of a polyhedron. Although its worst-case complexity is exponential, it is very fast in practice. → Basic: Simplex Method
Slack variable: A nonnegative variable that converts an inequality constraint $g_i(x) \le 0$ into an equality constraint $g_i(x) + s_i = 0,\; s_i \ge 0$. → Basic: Slack Variables

Special Problem Classes

Linear programming (LP): An optimization problem where the objective function and all constraints are linear. → Basic: Linear Programming
Quadratic programming (QP): A problem with a quadratic objective function and linear constraints. → Basic: Quadratic Programming
Semidefinite programming (SDP): A convex optimization problem where the variable is a positive semidefinite matrix $X \succeq 0$. Used in combinatorial optimization relaxations, control theory, and quantum information.
Second-order cone programming (SOCP): A convex problem with constraints of the form $\|Ax + b\| \le c^Tx + d$. It lies between LP and SDP. Used in robust optimization and portfolio optimization.
DC programming: A problem where the objective function is expressed as the difference of two convex functions: $f(x) = g(x) - h(x)$. The DCA (DC Algorithm) finds local solutions. It encompasses a broad subclass of nonconvex problems.
Integer programming (IP): A problem where some or all variables are constrained to be integers. Generally NP-hard. → Advanced: Integer Programming
Combinatorial optimization: A problem of finding an optimal solution from a discrete set of candidates. The traveling salesman problem (TSP) and the knapsack problem are representative examples. → Advanced: Combinatorial Optimization
Multi-objective optimization: A problem of simultaneously optimizing multiple objective functions. The goal is to find the set of Pareto optimal solutions. → Advanced: Multi-Objective Optimization
Robust optimization: Optimization that guarantees worst-case performance against parameter uncertainty. → Advanced: Robust Optimization
Stochastic optimization: An optimization problem where the objective function or constraints involve random variables. Includes expected value optimization and chance constraints. → Advanced: Optimization Under Uncertainty

Metaheuristics

Genetic algorithm (GA): An optimization method inspired by biological evolution. It improves solutions through repeated selection, crossover, and mutation. → Advanced: Genetic Algorithm
Simulated annealing (SA): A method inspired by the annealing process of metals. It stochastically explores solutions while lowering the temperature, enabling escape from local optima. → Advanced: Simulated Annealing
Particle swarm optimization (PSO): A method inspired by the flocking behavior of birds. Each particle searches while being attracted to both its personal best and the global best of the swarm. → Advanced: Swarm Intelligence
Differential evolution (DE): An evolutionary method that generates new candidate solutions using difference vectors between individuals. It is easy to tune and highly practical. → Advanced: Evolutionary Strategies
CMA-ES (Covariance Matrix Adaptation Evolution Strategy): An evolution strategy that adaptively updates the parameters (mean and covariance matrix) of a multivariate normal distribution. It is a standard method for black-box optimization. → Advanced: Evolutionary Strategies
Bayesian optimization: A method that approximates the objective function with a Gaussian process and selects the next evaluation point using an acquisition function. Effective for optimizing expensive-to-evaluate functions. → Advanced: Bayesian Optimization
Tabu search: A method that records recently visited solutions in a tabu list and forbids revisiting them while performing local search. Widely used in combinatorial optimization.