Optimization
Mathematical Optimization
What Is Optimization?
Optimization (mathematical optimization) is the mathematical discipline of finding the values of variables that minimize (or maximize) an objective function subject to given constraints. It is applied in a wide range of fields, including engineering, economics, and machine learning.
Figure 1: General Form of an Optimization Problem
Classification of Optimization Problems
- Linear Programming: Both the objective function and constraints are linear
- Convex Optimization: The objective function is convex and the feasible region is a convex set
- Nonlinear Programming: The objective function or constraints are nonlinear
- Integer Programming: Some or all variables are restricted to integers
- Stochastic Optimization: Optimization under uncertainty
Figure 2: Convex vs. Nonconvex Functions
Where Is It Used?
- Machine Learning: Loss function minimization, neural network training
- Operations Research: Production planning, transportation problems, scheduling
- Financial Engineering: Portfolio optimization, risk management
- Control Engineering: Optimal control, model predictive control
- Signal Processing: Filter design, compressed sensing
This series systematically covers optimization theory from fundamentals to advanced algorithms in four stages.
Intuitive Examples
Example 1: Minimizing a Quadratic Function
$$f(x) = x^2 - 4x + 5 = (x - 2)^2 + 1$$
Completing the square immediately shows that the minimum value $f(2) = 1$ is attained at $x = 2$. This is the simplest unconstrained optimization problem; using calculus, $f'(x) = 2x - 4 = 0$ yields the same result.
Example 2: The Hill-Climbing Intuition
Imagine a hiker trying to reach the summit in thick fog. They choose the steepest uphill direction within sight and walk that way -- this is the intuition behind gradient descent (with the sign flipped for minimization). However, the nearest hilltop may not be the highest peak (the local optimum problem).
Example 3: The Geometry of Constrained Optimization
Consider a consumer maximizing utility $u(x_1, x_2)$ subject to a budget constraint $p_1 x_1 + p_2 x_2 \le B$. The optimal solution lies at the tangent point between the contour lines (indifference curves) and the constraint region. At this point, the gradient of the objective and the normal to the constraint are parallel -- this is the geometric meaning of the method of Lagrange multipliers.
Content by Level
Introduction
Foundations of Optimization
- Formulating optimization problems
- Single-variable optimization
- Multivariable optimization
- Gradients and optimality conditions
- Method of Lagrange multipliers
- Convex sets and convex functions
Beginner
Linear Programming and Convex Optimization
- Linear programming problems
- Simplex method
- Duality theory
- Theory of convex optimization
- KKT conditions
- Quadratic programming
Intermediate
Numerical Optimization Algorithms
- Gradient descent
- Newton's method and quasi-Newton methods
- Conjugate gradient method
- Interior-point methods
- Stochastic gradient descent
- Constrained optimization algorithms
Advanced
Advanced Topics
- Integer programming
- Combinatorial optimization
- Large-scale and distributed optimization
- Robust optimization
- Calculus of variations and optimal control
- Semidefinite programming (SDP)
- Metaheuristics
- Optimization under uncertainty
Comparison of Major Algorithms
| Algorithm | Convergence Rate | Cost per Iteration | Memory | Applicability |
|---|---|---|---|---|
| Gradient Descent | Linear | $O(n)$ | $O(n)$ | Large-scale, smooth objectives |
| Stochastic Gradient Descent (SGD) | Sublinear | $O(n)$ | $O(n)$ | Large-scale data, machine learning |
| Conjugate Gradient | Superlinear | $O(n)$ | $O(n)$ | Large-scale quadratics, symmetric positive-definite systems |
| Newton's Method | Quadratic | $O(n^3)$ | $O(n^2)$ | Small to medium scale, twice differentiable |
| Quasi-Newton (BFGS) | Superlinear | $O(n^2)$ | $O(n^2)$ | Medium scale, first derivatives only |
| L-BFGS | Superlinear | $O(mn)$ | $O(mn)$ | Large-scale, memory-constrained |
| Trust Region Method | Quadratic | $O(n^3)$ | $O(n^2)$ | Nonconvex problems, robust convergence |
| Nelder-Mead Method | Sublinear | $O(n)$ | $O(n^2)$ | Non-differentiable, low-dimensional |
| Interior-Point Method | Polynomial | $O(n^3)$ | $O(n^2)$ | Linear and convex quadratic programs |
| Simplex Method | (Worst-case exponential) | $O(mn)$ | $O(mn)$ | Linear programs, fast in practice |
| Genetic Algorithm | Stochastic | $O(Nn)$ | $O(Nn)$ | Black-box and combinatorial optimization |
| Simulated Annealing | Stochastic | $O(n)$ | $O(n)$ | Discrete and continuous nonconvex problems |
$n$: number of variables, $m$: number of constraints or memory size, $N$: population size. Convergence rates are typical values for sufficiently smooth convex problems.
Key Concepts and Formulas
Definition of the Gradient
$$\nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n} \right)^T$$
First-Order Optimality Condition
$$\nabla f(x^*) = 0$$
Gradient Descent
$$x_{k+1} = x_k - \alpha_k \nabla f(x_k)$$
Lagrangian
$$L(x, \lambda, \mu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \mu_j h_j(x)$$
KKT Conditions
$$\nabla f(x^*) + \sum_i \lambda_i^* \nabla g_i(x^*) + \sum_j \mu_j^* \nabla h_j(x^*) = 0$$
$$\lambda_i^* g_i(x^*) = 0, \quad \lambda_i^* \geq 0$$
Definition of a Convex Function
$$f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta) f(y)$$
($0 \leq \theta \leq 1$)
Hessian Matrix (Second-Order Condition)
$$\nabla^2 f(x^*) = \left[\frac{\partial^2 f}{\partial x_i \partial x_j}\right] \succeq 0$$
Necessary condition for a stationary point $x^*$ to be a local minimum
Strong Convexity
$$f(y) \geq f(x) + \nabla f(x)^T(y - x) + \frac{m}{2}\|y - x\|^2$$
($m > 0$: strong convexity parameter; guarantees a unique global minimum)
Duality Gap
$$\eta = f(x) - g(\lambda, \mu) \geq 0$$
($f$: primal value, $g$: dual value; $\eta = 0$ indicates strong duality)
Newton's Method Update
$$x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)$$
Dual Problem
$$\max_{\lambda \geq 0, \mu} \min_{x} L(x, \lambda, \mu)$$
Weak duality always holds; strong duality ($\eta = 0$) holds for convex problems
Glossary
A glossary of frequently used terms in optimization theory is provided. It covers more than 50 terms organized by topic, including epigraph, subgradient, barrier function, duality gap, and complementary slackness.
Prerequisites
- Introduction: Calculus (single and multivariable), basics of linear algebra
- Beginner: Introduction-level content, matrix theory
- Intermediate: Beginner-level content, basics of numerical analysis
- Advanced: Intermediate-level content, functional analysis, combinatorics