Optimization

Mathematical Optimization

What Is Optimization?

Optimization (mathematical optimization) is the mathematical discipline of finding the values of variables that minimize (or maximize) an objective function subject to given constraints. It is applied in a wide range of fields, including engineering, economics, and machine learning.

Figure 1: General Form of an Optimization Problem

$$\begin{array}{ll} \textbf{minimize} & f(\mathbf{x}) \\[6pt] \textbf{subject to} & g_i(\mathbf{x}) \le 0, \quad i = 1, \dots, m \\ & h_j(\mathbf{x}) = 0, \quad j = 1, \dots, p \\ & \mathbf{x} \in X \end{array}$$

Classification of Optimization Problems

  • Linear Programming: Both the objective function and constraints are linear
  • Convex Optimization: The objective function is convex and the feasible region is a convex set
  • Nonlinear Programming: The objective function or constraints are nonlinear
  • Integer Programming: Some or all variables are restricted to integers
  • Stochastic Optimization: Optimization under uncertainty

Figure 2: Convex vs. Nonconvex Functions

Convex Global min = Local min Nonconvex Multiple local minima

Where Is It Used?

  • Machine Learning: Loss function minimization, neural network training
  • Operations Research: Production planning, transportation problems, scheduling
  • Financial Engineering: Portfolio optimization, risk management
  • Control Engineering: Optimal control, model predictive control
  • Signal Processing: Filter design, compressed sensing

This series systematically covers optimization theory from fundamentals to advanced algorithms in four stages.

Intuitive Examples

Example 1: Minimizing a Quadratic Function

$$f(x) = x^2 - 4x + 5 = (x - 2)^2 + 1$$

Completing the square immediately shows that the minimum value $f(2) = 1$ is attained at $x = 2$. This is the simplest unconstrained optimization problem; using calculus, $f'(x) = 2x - 4 = 0$ yields the same result.

Example 2: The Hill-Climbing Intuition

Imagine a hiker trying to reach the summit in thick fog. They choose the steepest uphill direction within sight and walk that way -- this is the intuition behind gradient descent (with the sign flipped for minimization). However, the nearest hilltop may not be the highest peak (the local optimum problem).

Example 3: The Geometry of Constrained Optimization

Consider a consumer maximizing utility $u(x_1, x_2)$ subject to a budget constraint $p_1 x_1 + p_2 x_2 \le B$. The optimal solution lies at the tangent point between the contour lines (indifference curves) and the constraint region. At this point, the gradient of the objective and the normal to the constraint are parallel -- this is the geometric meaning of the method of Lagrange multipliers.

Content by Level

Comparison of Major Algorithms

Algorithm Convergence Rate Cost per Iteration Memory Applicability
Gradient Descent Linear $O(n)$ $O(n)$ Large-scale, smooth objectives
Stochastic Gradient Descent (SGD) Sublinear $O(n)$ $O(n)$ Large-scale data, machine learning
Conjugate Gradient Superlinear $O(n)$ $O(n)$ Large-scale quadratics, symmetric positive-definite systems
Newton's Method Quadratic $O(n^3)$ $O(n^2)$ Small to medium scale, twice differentiable
Quasi-Newton (BFGS) Superlinear $O(n^2)$ $O(n^2)$ Medium scale, first derivatives only
L-BFGS Superlinear $O(mn)$ $O(mn)$ Large-scale, memory-constrained
Trust Region Method Quadratic $O(n^3)$ $O(n^2)$ Nonconvex problems, robust convergence
Nelder-Mead Method Sublinear $O(n)$ $O(n^2)$ Non-differentiable, low-dimensional
Interior-Point Method Polynomial $O(n^3)$ $O(n^2)$ Linear and convex quadratic programs
Simplex Method (Worst-case exponential) $O(mn)$ $O(mn)$ Linear programs, fast in practice
Genetic Algorithm Stochastic $O(Nn)$ $O(Nn)$ Black-box and combinatorial optimization
Simulated Annealing Stochastic $O(n)$ $O(n)$ Discrete and continuous nonconvex problems

$n$: number of variables, $m$: number of constraints or memory size, $N$: population size. Convergence rates are typical values for sufficiently smooth convex problems.

Key Concepts and Formulas

Definition of the Gradient

$$\nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n} \right)^T$$

First-Order Optimality Condition

$$\nabla f(x^*) = 0$$

Gradient Descent

$$x_{k+1} = x_k - \alpha_k \nabla f(x_k)$$

Lagrangian

$$L(x, \lambda, \mu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \mu_j h_j(x)$$

KKT Conditions

$$\nabla f(x^*) + \sum_i \lambda_i^* \nabla g_i(x^*) + \sum_j \mu_j^* \nabla h_j(x^*) = 0$$

$$\lambda_i^* g_i(x^*) = 0, \quad \lambda_i^* \geq 0$$

Definition of a Convex Function

$$f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta) f(y)$$

($0 \leq \theta \leq 1$)

Hessian Matrix (Second-Order Condition)

$$\nabla^2 f(x^*) = \left[\frac{\partial^2 f}{\partial x_i \partial x_j}\right] \succeq 0$$

Necessary condition for a stationary point $x^*$ to be a local minimum

Strong Convexity

$$f(y) \geq f(x) + \nabla f(x)^T(y - x) + \frac{m}{2}\|y - x\|^2$$

($m > 0$: strong convexity parameter; guarantees a unique global minimum)

Duality Gap

$$\eta = f(x) - g(\lambda, \mu) \geq 0$$

($f$: primal value, $g$: dual value; $\eta = 0$ indicates strong duality)

Newton's Method Update

$$x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)$$

Dual Problem

$$\max_{\lambda \geq 0, \mu} \min_{x} L(x, \lambda, \mu)$$

Weak duality always holds; strong duality ($\eta = 0$) holds for convex problems

Prerequisites

  • Introduction: Calculus (single and multivariable), basics of linear algebra
  • Beginner: Introduction-level content, matrix theory
  • Intermediate: Beginner-level content, basics of numerical analysis
  • Advanced: Intermediate-level content, functional analysis, combinatorics