What are the KKT conditions?

The Karush-Kuhn-Tucker conditions are the necessary first-order optimality conditions for constrained optimisation: stationarity of the Lagrangian, primal feasibility, dual feasibility, and complementary slackness. Under convexity they become sufficient as well.

Is simplex faster than the interior-point method?

It depends. Simplex has exponential worst case but is practically polynomial and excels at sparse small-to-medium LPs. Interior-point methods are polynomial in theory and shine on large dense LPs. sangi ships a two-phase revised simplex.

How do Levenberg-Marquardt and Gauss-Newton differ?

Gauss-Newton solves (JᵀJ) p = −Jᵀr and ignores the second-order Hessian term, which causes failure on rank-deficient Jacobians. Levenberg-Marquardt adds a damping term (JᵀJ + λI) p = −Jᵀr and adjusts λ trust-region-style. The sangi gaussNewtonMinimize routine uses LM damping.

Penalty methods vs augmented Lagrangian: when to use which?

Penalty methods drive the penalty coefficient μ to infinity and become ill-conditioned. Augmented Lagrangian estimates Lagrange multipliers as well, achieving constraint satisfaction at moderate μ. Augmented Lagrangian is generally more reliable.

When should I prefer FISTA over ADMM?

FISTA tackles f + g with smooth f and a proximable g and achieves O(1/k²) convergence with Nesterov acceleration. ADMM decomposes f(x) + g(z) with Ax + Bz = c into two proximal subproblems, which is ideal for composite constraints and distributed optimisation. FISTA is usually simpler and faster on Lasso; ADMM is preferred when the structure benefits from variable splitting.

Constrained Optimization, LP, and Convex Programming

Overview

Constrained optimisation solves

$$\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad \mathbf{g}(\mathbf{x}) \leq 0, \;\; \mathbf{h}(\mathbf{x}) = 0,$$

with $\mathbf{x} \in \mathbb{R}^n$. Algorithms differ along linearity, convexity, smoothness, and the availability of derivatives. sangi ships linear programming, nonlinear least squares, bound-constrained L-BFGS-B, penalty, augmented Lagrangian, and proximal convex optimisation (FISTA, ADMM) in optimization_nd.hpp / linear_programming.hpp / ConvexOptimization.hpp.

See Optimization-unconstrained for unconstrained methods and Optimization-derivative-free for derivative-free approaches.

Linear programming (LP)

$$\min \mathbf{c}^T \mathbf{x} \quad \text{subject to} \quad A \mathbf{x} \;\{\leq, =, \geq\}\; \mathbf{b}, \quad \mathbf{x} \geq 0.$$

Transportation, production planning, portfolio optimisation, and network flow are classical applications.

Simplex method

Dantzig (1947) walks along vertices of the feasible polytope. Each iteration:

Compute reduced costs from the current basis via dual variables.
Pick an incoming non-basic variable with negative reduced cost; pick an outgoing basic variable using a pivot rule (Bland, largest improvement, etc.).
Update the basis and move to the next vertex.

Worst-case complexity is exponential (Klee-Minty), but in practice the method is polynomial. The revised simplex stores an LU factorisation of the basis instead of its inverse, scaling gracefully on sparse large LPs.

Two-phase method

When no obvious basic feasible solution exists, Phase I solves an auxiliary LP with artificial variables to obtain one, and Phase II minimises the original objective. sangi simplex implements two-phase revised simplex with $\leq, =, \geq$ constraints.

Duality

The primal $\min \mathbf{c}^T \mathbf{x}$ s.t. $A \mathbf{x} \leq \mathbf{b}, \mathbf{x} \geq 0$ has dual $\max \mathbf{b}^T \mathbf{y}$ s.t. $A^T \mathbf{y} \leq \mathbf{c}, \mathbf{y} \geq 0$, and strong duality makes their optima coincide. Duals provide sensitivity analysis and the economic interpretation of shadow prices.

Interior-point and convex programming

Where simplex walks the boundary, interior-point methods cut through the interior toward the optimum.

Primal-dual path follower

Relax $\mathbf{x} \geq 0$ with a logarithmic barrier $-\mu \sum_i \log x_i$:

$$\min \mathbf{c}^T \mathbf{x} - \mu \sum_i \log x_i \quad \text{subject to} \quad A \mathbf{x} = \mathbf{b}.$$

The minimiser converges to the LP optimum as $\mu \to 0$. Newton iterations chase the central path. Karmarkar (1984) established polynomial time, and Mehrotra (1992) predictor-corrector dramatically improved practical performance. On large dense LPs interior-point dominates.

Convex programming, SOCP, SDP

Interior-point extends beyond LP:

QP: $\min \tfrac{1}{2} \mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T \mathbf{x}$, $A \mathbf{x} \leq \mathbf{b}$ (SVMs, MPC, etc.).
SOCP: $\|A_i \mathbf{x} + \mathbf{b}_i\| \leq \mathbf{c}_i^T \mathbf{x} + d_i$ (robust LP, portfolio optimisation).
SDP: matrix variables with positive semidefinite constraints (LMI control, combinatorial relaxations).

The self-concordant barrier theory (Nesterov-Nemirovski 1994) unifies all of these. The current sangi ships only LP simplex; QP, SOCP, and SDP are future work.

KKT conditions and Lagrange multipliers

For the constrained problem with $\mathbf{h}(\mathbf{x}) = 0$ and $\mathbf{g}(\mathbf{x}) \leq 0$, define the Lagrangian

$$\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu}) = f(\mathbf{x}) + \boldsymbol{\lambda}^T \mathbf{h}(\mathbf{x}) + \boldsymbol{\mu}^T \mathbf{g}(\mathbf{x}).$$

If $\mathbf{x}^*$ is a local minimiser and constraint qualifications (e.g. LICQ) hold, there exist $\boldsymbol{\lambda}^*, \boldsymbol{\mu}^*$ such that the KKT conditions hold:

Condition	Statement
Stationarity	$\nabla_x \mathcal{L}(\mathbf{x}^, \boldsymbol{\lambda}^, \boldsymbol{\mu}^*) = 0$
Primal feasibility	$\mathbf{h}(\mathbf{x}^) = 0, \;\; \mathbf{g}(\mathbf{x}^) \leq 0$
Dual feasibility	$\boldsymbol{\mu}^* \geq 0$
Complementary slackness	$\mu_i^* \cdot g_i(\mathbf{x}^*) = 0 \;\; \forall i$

Under convexity ($f, g_i$ convex, $h_j$ affine) the conditions are also sufficient. Most constrained algorithms (SQP, interior-point, active-set) iteratively satisfy the KKT system.

SQP and active-set methods

SQP (Sequential Quadratic Programming)

SQP replaces the nonlinear problem with a sequence of QP sub-problems. Each iteration solves

$$\min_{\mathbf{p}} \; \nabla f(\mathbf{x}_k)^T \mathbf{p} + \tfrac{1}{2} \mathbf{p}^T H_k \mathbf{p}$$ $$\text{s.t.} \;\; \nabla h_j(\mathbf{x}_k)^T \mathbf{p} + h_j(\mathbf{x}_k) = 0, \;\; \nabla g_i(\mathbf{x}_k)^T \mathbf{p} + g_i(\mathbf{x}_k) \leq 0,$$

where $H_k$ is the Lagrangian Hessian or its BFGS approximation. The method achieves quadratic convergence and underlies SNOPT, NLPQL, and similar solvers. SQP is not yet in sangi; it is a candidate addition.

Active-set methods

Assume the inequalities currently active at the iterate, solve the resulting equality-constrained sub-problem, and update the active set based on Lagrange multiplier signs or KKT residuals. Active-set methods are classical for QP and excel on small-to-medium problems with sparse constraint structure.

Penalty and augmented Lagrangian

Penalty methods

Fold constraint violations into the objective and minimise it unconstrained:

$$\Phi_\mu(\mathbf{x}) = f(\mathbf{x}) + \mu \sum_i \max(0, g_i(\mathbf{x}))^2 + \mu \sum_j h_j(\mathbf{x})^2.$$

Constraints are satisfied in the limit $\mu \to \infty$. Simple to implement but the conditioning of the inner problem deteriorates with $\mu$.

sangi penaltyMinimize handles inequality constraints by increasing $\mu$ in stages.

Augmented Lagrangian

Add Lagrange multiplier estimates to the penalty:

$$\mathcal{L}_A(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu}; \rho) = f(\mathbf{x}) + \boldsymbol{\lambda}^T \mathbf{h}(\mathbf{x}) + \tfrac{\rho}{2} \|\mathbf{h}(\mathbf{x})\|^2 + \cdots$$

The outer loop updates $\boldsymbol{\lambda} \leftarrow \boldsymbol{\lambda} + \rho \, \mathbf{h}(\mathbf{x}_k)$. Constraints become tight without forcing $\rho \to \infty$, sidestepping the ill-conditioning of pure penalty methods. Also known as Powell-Hestenes-Rockafellar.

sangi augmentedLagrangianMinimize handles both equality and inequality constraints.

Box constraints (L-BFGS-B)

When each variable has an independent lower and upper bound $l_i \leq x_i \leq u_i$, gradient projection combined with L-BFGS gives an efficient solver. L-BFGS-B (Byrd, Lu, Nocedal, Zhu 1995) is the standard implementation.

Each iteration:

Compute the Cauchy point by projecting the gradient onto the feasible set; determine the active set.
Take an L-BFGS step on the free variables ($l_i < x_i < u_i$).
Project the resulting step into the box via line search.

Use $l_i = -\infty$ or $u_i = +\infty$ to leave a variable unbounded. sangi lbfgsbMinimize implements this scheme.

Nonlinear least squares (Gauss-Newton, LM)

For residuals $\mathbf{r}: \mathbb{R}^n \to \mathbb{R}^m$ ($m \geq n$), minimise $\tfrac{1}{2} \|\mathbf{r}(\mathbf{x})\|^2$. Curve fitting, robotics, and bundle adjustment all reduce to NLS.

Gauss-Newton

Linearise the residuals as $\mathbf{r}(\mathbf{x} + \mathbf{p}) \approx \mathbf{r}(\mathbf{x}) + J \mathbf{p}$ and solve the normal equations:

$$J^T J \, \mathbf{p} = -J^T \mathbf{r}.$$

Quadratic convergence on small residuals. The omitted second-order Hessian term hurts when $J^T J$ is ill-conditioned or rank-deficient.

Levenberg-Marquardt

Add damping:

$$(J^T J + \lambda I) \mathbf{p} = -J^T \mathbf{r}.$$

Large $\lambda$ behaves like steepest descent; small $\lambda$ matches Gauss-Newton. Update $\lambda$ from the gain ratio $\rho = \frac{\Delta f_{\text{actual}}}{\Delta f_{\text{predicted}}}$ in trust-region style (Marquardt 1963).

sangi gaussNewtonMinimize uses fixed $\lambda$ with Armijo line search; leastSquaresFit implements the standard adaptive LM. curveFit provides a high-level model-fit API.

Covariance and statistics

After convergence the parameter covariance is $\mathrm{Cov}(\hat{\mathbf{p}}) = s^2 (J^T J)^{-1}$ with $s^2 = \|\mathbf{r}\|^2 / (m - n)$. Standard errors, $\chi^2$, and reduced $\chi^2$ are returned by LeastSquaresFitResult.

FISTA and ADMM

Modern machine learning and signal processing rely on non-smooth convex optimisation (L1 regularisation, total variation, etc.). Two algorithms dominate this regime.

FISTA

Beck and Teboulle (2009) handle the composite problem

$$\min_{\mathbf{x}} \; f(\mathbf{x}) + g(\mathbf{x}),$$

where $f$ is smooth convex with $L$-Lipschitz gradient and $g$ is a convex function with cheap proximal operator $\mathrm{prox}_{\alpha g}(v) = \arg\min_u \{g(u) + \frac{1}{2\alpha} \|u - v\|^2\}$.

Iteration:

$$\mathbf{y}_k = \mathbf{x}_k + \frac{t_{k-1} - 1}{t_k} (\mathbf{x}_k - \mathbf{x}_{k-1})$$ $$\mathbf{x}_{k+1} = \mathrm{prox}_{(1/L) g}\bigl(\mathbf{y}_k - \tfrac{1}{L} \nabla f(\mathbf{y}_k)\bigr)$$ $$t_{k+1} = \tfrac{1 + \sqrt{1 + 4 t_k^2}}{2}.$$

Nesterov-style momentum gives $O(1/k^2)$ convergence. Lasso ($g = \lambda \|\cdot\|_1$, prox = soft thresholding), TV restoration, and low-rank matrix recovery all run through FISTA.

sangi fista is the general routine, and lassoFISTA is the Lasso specialisation (with $L$ estimated by power iteration).

ADMM

Going back to Glowinski-Marrocco (1975) and Gabay-Mercier (1976), ADMM solves

$$\min \; f(\mathbf{x}) + g(\mathbf{z}) \quad \text{subject to} \quad A \mathbf{x} + B \mathbf{z} = \mathbf{c}$$

by alternating proximal minimisations of an augmented Lagrangian and updating a dual variable:

$$\mathbf{x}_{k+1} = \arg\min_{\mathbf{x}} \; f(\mathbf{x}) + \tfrac{\rho}{2} \|A \mathbf{x} + B \mathbf{z}_k - \mathbf{c} + \mathbf{u}_k\|^2$$ $$\mathbf{z}_{k+1} = \arg\min_{\mathbf{z}} \; g(\mathbf{z}) + \tfrac{\rho}{2} \|A \mathbf{x}_{k+1} + B \mathbf{z} - \mathbf{c} + \mathbf{u}_k\|^2$$ $$\mathbf{u}_{k+1} = \mathbf{u}_k + A \mathbf{x}_{k+1} + B \mathbf{z}_{k+1} - \mathbf{c}.$$

Variable splitting decomposes complex problems into two proximal subproblems, exploiting any structure in $f$ and $g$. ADMM is the workhorse of distributed and consensus optimisation. sangi admmLasso is the Lasso specialisation.

For a thorough survey see Boyd et al. (2011), Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.

Dispatch in sangi

Problem shape	Recommended function (sangi)
Linear objective and constraints (LP)	`simplex` / `simplexMaximize`
Box constraints $l \leq x \leq u$ only	`lbfgsbMinimize`
Inequalities only, gradient available	`penaltyMinimize`
Mixed equality + inequality, gradient	`augmentedLagrangianMinimize`
Nonlinear least squares $\sum r_i(x)^2$	`gaussNewtonMinimize` / `leastSquaresFit`
Curve fitting $y = f(x, p)$	`curveFit`
Lasso $\frac{1}{2}\\|Ax - b\\|^2 + \lambda \\|x\\|_1$	`lassoFISTA` / `admmLasso`
General composite convex $f + g$ (prox-able $g$)	`fista`
Constrained, no gradient	`cmaesMinimize` with manual penalty (derivative-free)

Scaling matters

Constrained optimisation is sensitive to variable and constraint scaling. Normalising variables to roughly $[-1, 1]$ and matching constraint magnitudes drastically improves conditioning. In LP, row/column equilibration is standard; in NLS, Jacobian column scaling pays off.

References

Nocedal, J. and Wright, S. J. (2006). Numerical Optimization, 2nd ed. Springer.
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
Bertsekas, D. P. (1999). Nonlinear Programming, 2nd ed. Athena Scientific.
Marquardt, D. W. (1963). "An Algorithm for Least-Squares Estimation of Nonlinear Parameters". SIAM J. Appl. Math., 11, 431–441.
Beck, A. and Teboulle, M. (2009). "A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems". SIAM J. Imaging Sci., 2, 183–202.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). "Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers". Foundations and Trends in Machine Learning, 3, 1–122.