Why is convex optimization important?

In convex optimization, every local optimum is also a global optimum, so efficient algorithms can reliably find the optimal solution. Many practical problems, including linear programs and quadratic programs, can be formulated as convex optimization problems.

What is the difference between gradient descent and Newton's method?

Gradient descent uses only first-order derivatives (the gradient), costs O(n) per iteration, and converges linearly. Newton's method also uses second-order derivatives (the Hessian matrix), costs O(n^3) per iteration, and converges quadratically. Gradient descent is suited to large-scale problems, while Newton's method is better for small to medium problems requiring high accuracy.

Optimization

Q: What is an optimization problem?

An optimization problem is a mathematical formulation for finding the values of variables that minimize (or maximize) an objective function subject to given constraints. It is widely applied in engineering, economics, machine learning, and many other fields.

Mathematical Optimization

What Is Optimization?

Optimization (mathematical optimization) is the mathematical discipline of finding the values of variables that minimize (or maximize) an objective function subject to given constraints. It is applied in a wide range of fields, including engineering, economics, and machine learning.

Figure 1: General Form of an Optimization Problem

$$\begin{array}{ll} \textbf{minimize} & f(\mathbf{x}) \\[6pt] \textbf{subject to} & g_i(\mathbf{x}) \le 0, \quad i = 1, \dots, m \\ & h_j(\mathbf{x}) = 0, \quad j = 1, \dots, p \\ & \mathbf{x} \in X \end{array}$$

Classification of Optimization Problems

Linear Programming: Both the objective function and constraints are linear
Convex Optimization: The objective function is convex and the feasible region is a convex set
Nonlinear Programming: The objective function or constraints are nonlinear
Integer Programming: Some or all variables are restricted to integers
Stochastic Optimization: Optimization under uncertainty

Figure 2: Convex vs. Nonconvex Functions

Where Is It Used?

Machine Learning: Loss function minimization, neural network training
Operations Research: Production planning, transportation problems, scheduling
Financial Engineering: Portfolio optimization, risk management
Control Engineering: Optimal control, model predictive control
Signal Processing: Filter design, compressed sensing

This series systematically covers optimization theory from fundamentals to advanced algorithms in four stages.

Intuitive Examples

Example 1: Minimizing a Quadratic Function

$$f(x) = x^2 - 4x + 5 = (x - 2)^2 + 1$$

Completing the square immediately shows that the minimum value $f(2) = 1$ is attained at $x = 2$. This is the simplest unconstrained optimization problem; using calculus, $f'(x) = 2x - 4 = 0$ yields the same result.

Example 2: The Hill-Climbing Intuition

Imagine a hiker trying to reach the summit in thick fog. They choose the steepest uphill direction within sight and walk that way -- this is the intuition behind gradient descent (with the sign flipped for minimization). However, the nearest hilltop may not be the highest peak (the local optimum problem).

Example 3: The Geometry of Constrained Optimization

Consider a consumer maximizing utility $u(x_1, x_2)$ subject to a budget constraint $p_1 x_1 + p_2 x_2 \le B$. The optimal solution lies at the tangent point between the contour lines (indifference curves) and the constraint region. At this point, the gradient of the objective and the normal to the constraint are parallel -- this is the geometric meaning of the method of Lagrange multipliers.

Content by Level

Introduction

Foundations of Optimization

Formulating optimization problems
Single-variable optimization
Multivariable optimization
Gradients and optimality conditions
Method of Lagrange multipliers
Convex sets and convex functions

High school to 1st year undergraduate

Beginner

Linear Programming and Convex Optimization

Linear programming problems
Simplex method
Duality theory
Theory of convex optimization
KKT conditions
Quadratic programming

1st-2nd year undergraduate

Intermediate

Numerical Optimization Algorithms

Gradient descent
Newton's method and quasi-Newton methods
Conjugate gradient method
Interior-point methods
Stochastic gradient descent
Constrained optimization algorithms

3rd-4th year undergraduate

Advanced

Advanced Topics

Integer programming
Combinatorial optimization
Large-scale and distributed optimization
Robust optimization
Calculus of variations and optimal control
Semidefinite programming (SDP)
Metaheuristics
Optimization under uncertainty

Graduate level

Comparison of Major Algorithms

Algorithm	Convergence Rate	Cost per Iteration	Memory	Applicability
Gradient Descent	Linear	$O(n)$	$O(n)$	Large-scale, smooth objectives
Stochastic Gradient Descent (SGD)	Sublinear	$O(n)$	$O(n)$	Large-scale data, machine learning
Conjugate Gradient	Superlinear	$O(n)$	$O(n)$	Large-scale quadratics, symmetric positive-definite systems
Newton's Method	Quadratic	$O(n^3)$	$O(n^2)$	Small to medium scale, twice differentiable
Quasi-Newton (BFGS)	Superlinear	$O(n^2)$	$O(n^2)$	Medium scale, first derivatives only
L-BFGS	Superlinear	$O(mn)$	$O(mn)$	Large-scale, memory-constrained
Trust Region Method	Quadratic	$O(n^3)$	$O(n^2)$	Nonconvex problems, robust convergence
Nelder-Mead Method	Sublinear	$O(n)$	$O(n^2)$	Non-differentiable, low-dimensional
Interior-Point Method	Polynomial	$O(n^3)$	$O(n^2)$	Linear and convex quadratic programs
Simplex Method	(Worst-case exponential)	$O(mn)$	$O(mn)$	Linear programs, fast in practice
Genetic Algorithm	Stochastic	$O(Nn)$	$O(Nn)$	Black-box and combinatorial optimization
Simulated Annealing	Stochastic	$O(n)$	$O(n)$	Discrete and continuous nonconvex problems

$n$: number of variables, $m$: number of constraints or memory size, $N$: population size. Convergence rates are typical values for sufficiently smooth convex problems.

Key Concepts and Formulas

Definition of the Gradient

$$\nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n} \right)^T$$

First-Order Optimality Condition

$$\nabla f(x^*) = 0$$

Gradient Descent

$$x_{k+1} = x_k - \alpha_k \nabla f(x_k)$$

Lagrangian

$$L(x, \lambda, \mu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \mu_j h_j(x)$$

KKT Conditions

$$\nabla f(x^*) + \sum_i \lambda_i^* \nabla g_i(x^*) + \sum_j \mu_j^* \nabla h_j(x^*) = 0$$

$$\lambda_i^* g_i(x^*) = 0, \quad \lambda_i^* \geq 0$$

Definition of a Convex Function

$$f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta) f(y)$$

($0 \leq \theta \leq 1$)

Hessian Matrix (Second-Order Condition)

$$\nabla^2 f(x^*) = \left[\frac{\partial^2 f}{\partial x_i \partial x_j}\right] \succeq 0$$

Necessary condition for a stationary point $x^*$ to be a local minimum

Strong Convexity

$$f(y) \geq f(x) + \nabla f(x)^T(y - x) + \frac{m}{2}\|y - x\|^2$$

($m > 0$: strong convexity parameter; guarantees a unique global minimum)

Duality Gap

$$\eta = f(x) - g(\lambda, \mu) \geq 0$$

($f$: primal value, $g$: dual value; $\eta = 0$ indicates strong duality)

Newton's Method Update

$$x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)$$

Dual Problem

$$\max_{\lambda \geq 0, \mu} \min_{x} L(x, \lambda, \mu)$$

Weak duality always holds; strong duality ($\eta = 0$) holds for convex problems

Glossary

A glossary of frequently used terms in optimization theory is provided. It covers more than 50 terms organized by topic, including epigraph, subgradient, barrier function, duality gap, and complementary slackness.

Prerequisites

Introduction: Calculus (single and multivariable), basics of linear algebra
Beginner: Introduction-level content, matrix theory
Intermediate: Beginner-level content, basics of numerical analysis
Advanced: Intermediate-level content, functional analysis, combinatorics