Strassen's Algorithm

Fast Matrix Multiplication via Divide-and-Conquer with $O(n^{\log_2 7})$ Complexity

Advanced

1. Introduction

The naive product $C = AB$ of two $n \times n$ matrices computes each entry $c_{ij} = \displaystyle\sum_{k=1}^n a_{ik} b_{kj}$ with $n$ multiplications and $n-1$ additions, giving $O(n^3)$ in total.

In 1969, Volker Strassen showed that the product of two $2 \times 2$ matrices can be computed with only seven multiplications instead of eight, and that recursive application reduces the overall complexity to

$$O(n^{\log_2 7}) = O(n^{2.807\ldots}).$$

This was the first proof that $O(n^3)$ is not necessary for matrix multiplication, and it launched the field of fast matrix multiplication.

2. Complexity of naive matrix multiplication

Consider the $2 \times 2$ matrix product:

$$\begin{pmatrix} c_{11} & c_{12} \\ c_{21} & c_{22} \end{pmatrix} = \begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix} \begin{pmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{pmatrix}$$

By the standard definition

$$c_{ij} = \displaystyle\sum_{k=1}^{2} a_{ik} b_{kj},$$

this requires 8 multiplications and 4 additions. Splitting an $n \times n$ matrix into $n/2 \times n/2$ blocks and applying this recursively gives the recurrence

$$T(n) = 8\, T(n/2) + O(n^2),$$

and by the Master theorem (the theorem that gives the asymptotic solution of divide-and-conquer recurrences $T(n) = a\,T(n/b) + f(n)$ by comparing $f(n)$ with $n^{\log_b a}$) we obtain $T(n) = O(n^{\log_2 8}) = O(n^3)$.

3. Strassen's idea — seven multiplications

Strassen introduced the following seven auxiliary products $M_1, \ldots, M_7$:

Strassen's seven products

$$\begin{aligned} M_1 &= (a_{11} + a_{22})(b_{11} + b_{22}) \\ M_2 &= (a_{21} + a_{22})\, b_{11} \\ M_3 &= a_{11}\, (b_{12} - b_{22}) \\ M_4 &= a_{22}\, (b_{21} - b_{11}) \\ M_5 &= (a_{11} + a_{12})\, b_{22} \\ M_6 &= (a_{21} - a_{11})(b_{11} + b_{12}) \\ M_7 &= (a_{12} - a_{22})(b_{21} + b_{22}) \end{aligned}$$

From these seven $M_i$, every entry of $C$ can be recovered using only additions and subtractions:

Recovery of the entries of $C$

$$\begin{aligned} c_{11} &= M_1 + M_4 - M_5 + M_7 \\ c_{12} &= M_3 + M_5 \\ c_{21} &= M_2 + M_4 \\ c_{22} &= M_1 - M_2 + M_3 + M_6 \end{aligned}$$

3.1 Verification

For example, expanding $c_{11}$:

$$\begin{aligned} c_{11} &= M_1 + M_4 - M_5 + M_7 \\ &= (a_{11}b_{11} + a_{11}b_{22} + a_{22}b_{11} + a_{22}b_{22}) \\ &\quad + (a_{22}b_{21} - a_{22}b_{11}) \\ &\quad - (a_{11}b_{22} + a_{12}b_{22}) \\ &\quad + (a_{12}b_{21} + a_{12}b_{22} - a_{22}b_{21} - a_{22}b_{22}) \\ &= a_{11}b_{11} + a_{12}b_{21} \end{aligned}$$

which is the correct result. The other entries can be verified analogously.

3.2 Complexity

Splitting an $n \times n$ matrix into $n/2 \times n/2$ blocks and applying Strassen recursively to compute the $M_i$, the multiplication count drops from 8 to 7 and the recurrence becomes

$$T(n) = 7\, T(n/2) + O(n^2).$$

By the Master theorem,

$$T(n) = O(n^{\log_2 7}) = O(n^{2.807\ldots}).$$

The number of additions/subtractions grows from 4 to 18, but additions cost only $O(n^2)$, so for large $n$ the saving in multiplications dominates.

3.3 Worked example: Strassen on $2 \times 2$ matrices

Take $A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}$ and $B = \begin{pmatrix} 5 & 6 \\ 7 & 8 \end{pmatrix}$.

Step 1. Compute the seven auxiliary products.

$$\begin{aligned} M_1 &= (a_{11}+a_{22})(b_{11}+b_{22}) = (1+4)(5+8) = 5 \cdot 13 = 65 \\ M_2 &= (a_{21}+a_{22})\, b_{11} = (3+4) \cdot 5 = 35 \\ M_3 &= a_{11}(b_{12}-b_{22}) = 1 \cdot (6-8) = -2 \\ M_4 &= a_{22}(b_{21}-b_{11}) = 4 \cdot (7-5) = 8 \\ M_5 &= (a_{11}+a_{12})\, b_{22} = (1+2) \cdot 8 = 24 \\ M_6 &= (a_{21}-a_{11})(b_{11}+b_{12}) = (3-1)(5+6) = 22 \\ M_7 &= (a_{12}-a_{22})(b_{21}+b_{22}) = (2-4)(7+8) = -30 \end{aligned}$$

Step 2. Recover the result.

$$\begin{aligned} c_{11} &= M_1 + M_4 - M_5 + M_7 = 65 + 8 - 24 - 30 = 19 \\ c_{12} &= M_3 + M_5 = -2 + 24 = 22 \\ c_{21} &= M_2 + M_4 = 35 + 8 = 43 \\ c_{22} &= M_1 - M_2 + M_3 + M_6 = 65 - 35 - 2 + 22 = 50 \end{aligned}$$

Check: $AB = \begin{pmatrix} 1\cdot5+2\cdot7 & 1\cdot6+2\cdot8 \\ 3\cdot5+4\cdot7 & 3\cdot6+4\cdot8 \end{pmatrix} = \begin{pmatrix} 19 & 22 \\ 43 & 50 \end{pmatrix}$ ✓

4. Recursive structure (diagram)

Figure 1: Recursive structure of Strassen's algorithm An n x n matrix is split into four n/2 x n/2 blocks and seven recursive multiplications are performed A₁₁ A₁₂ A₂₁ A₂₂ A (n×n) × B₁₁ B₁₂ B₂₁ B₂₂ B (n×n) 7 recursive multiplications M₁ = (A₁₁+A₂₂)(B₁₁+B₂₂) M₂ = (A₂₁+A₂₂) B₁₁ M₃ = A₁₁ (B₁₂−B₂₂) M₄ = A₂₂ (B₂₁−B₁₁) M₅ = (A₁₁+A₁₂) B₂₂ M₆ = (A₂₁−A₁₁)(B₁₁+B₁₂) M₇ = (A₁₂−A₂₂)(B₂₁+B₂₂) 18 additions/subtractions to recover C C₁₁ = M₁+M₄−M₅+M₇ C₁₂ = M₃+M₅ C₂₁ = M₂+M₄ Naive: 8 mults → O(n³)  Strassen: 7 mults → O(n^2.807)
Figure 1: Recursive structure of Strassen's algorithm. An $n \times n$ matrix is split into four blocks and computed via seven $n/2 \times n/2$ matrix multiplications.

5. Practical considerations

5.1 Crossover (threshold)

Because Strassen has 18 additions/subtractions, the overhead dominates for small matrices. In practice, when the matrix size at the bottom of the recursion drops below a threshold $n_0$, the algorithm switches to the naive product.

Typically $n_0$ is in the tens to hundreds, depending on hardware (cache size, SIMD width) and memory access patterns.

5.2 Numerical stability

Strassen uses many additions/subtractions, so cancellation errors accumulate. In floating-point arithmetic, the forward error of the naive product is bounded by

$$|c_{ij} - \hat{c}_{ij}| \leq n\, u\, \|A\|_\infty \|B\|_\infty + O(u^2),$$

whereas Strassen satisfies a weaker bound

$$|c_{ij} - \hat{c}_{ij}| \leq O(n^{\log_2 12})\, u\, \|A\|_\infty \|B\|_\infty,$$

(where $u$ is the unit round-off). For high-precision applications (matrix inversion, eigenvalue decompositions, etc.) this requires care. Multi-precision integer multiplication (no rounding error) does not have this stability issue.

5.3 Odd sizes

If $n$ is odd, the matrix cannot be split evenly. The standard fix is padding (add one row and one column to make the dimension even); the padded entries are removed at the end.

Example: For $n=3$, zero-pad the $3 \times 3$ matrix to $4 \times 4$.

$$A = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{pmatrix} \;\longrightarrow\; A' = \begin{pmatrix} a_{11} & a_{12} & a_{13} & 0 \\ a_{21} & a_{22} & a_{23} & 0 \\ a_{31} & a_{32} & a_{33} & 0 \\ 0 & 0 & 0 & 0 \end{pmatrix}$$

Run Strassen on $C' = A' B'$ and extract the top-left $3 \times 3$ block to recover $AB$. When recursing, repad whenever the current dimension becomes odd.

5.4 Memory usage

Each level of recursion needs temporary storage for the $M_i$, increasing memory consumption over the naive product. The depth of the recursion is $O(\log n)$ and each level needs $O(n^2)$ temporary memory, so the additional memory is $O(n^2 \log n)$ in total.

5.5 Cache efficiency

The recursive partitioning of Strassen breaks the cache locality that the naive triple loop (ijk or ikj order) achieves so naturally. In particular, the creation and destruction of temporaries for each $M_i$ multiplies cache misses, and memory bandwidth often becomes the bottleneck.

For this reason, high-performance BLAS libraries such as BLIS, Intel MKL, and OpenBLAS do not adopt Strassen; instead they keep the asymptotic $O(n^3)$ complexity but extract peak performance with cache blocking (hierarchical block-by-block organisation) combined with SIMD. Strassen wins on multiplication count, but it is not always faster in wall-clock time even for $n$ in the few thousands because it is not cache-friendly.

5.6 Implementation tips

  • Reuse workspace: instead of allocating new matrices at each recursion level, pre-allocate the total memory once and reuse the same buffers across the seven auxiliary products. This eliminates allocator pressure and GC overhead.
  • Avoid copies (strided views): represent sub-matrices as views with a leading dimension (a "stride") rather than copying the data. This matches the BLAS lda convention and lets sub-block operations proceed in place.
  • SIMD-accelerated naive base case: switch off Strassen below a recursion threshold $n_0$ (typically 64–256) and fall through to a SIMD-vectorised naive block multiply (for example with $4 \times 4$ register blocks). This base case dominates the runtime in practice.
  • Parallelism: the seven auxiliary products $M_i$ are independent, so a thread/task pool can scale up to seven-way parallelism on each recursion level with very little coordination.

6. Winograd variant

S. Winograd showed a variant that uses the same 7 multiplications as Strassen but reduces additions/subtractions from 18 to 15. The asymptotic complexity is the same $O(n^{2.807})$, but the constant factor is improved, so it is somewhat faster in practice.

Most modern high-performance linear algebra libraries that include a Strassen-style algorithm (in the internals of BLAS Level 3, etc.) use the Winograd variant.

7. History of the matrix multiplication exponent $\omega$

Writing the number of operations needed for $n \times n$ matrix multiplication as $O(n^\omega)$, $\omega$ is called the matrix multiplication exponent. The theoretical lower bound is $\omega \geq 2$ (since the output has $n^2$ entries).

Year Authors Exponent $\omega$
Naive method$3.000$
1969Strassen$2.807$
1978Pan$2.796$
1981Schönhage$2.548$
1986Strassen$2.479$
1990Coppersmith–Winograd$2.376$
2012Williams$2.3727$
2024Duan–Wu–Zhou$2.371339$

Whether $\omega = 2$ is achievable is one of the major open problems in computational complexity theory.

7.1 Beyond Strassen and "galactic algorithms"

Post-Strassen improvements rely on more elaborate tensor-decomposition constructions and the laser method. From Coppersmith–Winograd onward, these algorithms have driven the exponent down to $\omega \approx 2.37$, but the hidden constants and lower-order terms are astronomically large. For realistic matrix sizes (say $n \le 10^{12}$), these algorithms lose not only to Strassen but to naive multiplication as well.

Such algorithms—asymptotically fast but practically slower or outright unimplementable—are known as galactic algorithms. In this taxonomy Strassen is one of the few fast matrix-multiplication algorithms that is not galactic, which is why it still appears in production BLAS and multiprecision multiplication today.

8. Summary

  • Strassen showed how to compute the product of two $2 \times 2$ matrices using only 7 multiplications.
  • Applied recursively, the product of $n \times n$ matrices runs in $O(n^{2.807})$.
  • In practice, a hybrid strategy that switches to the naive product below a threshold size is used.
  • Numerical stability is somewhat worse, but is not an issue for multi-precision integer arithmetic.
  • The Winograd variant reduces additions/subtractions from 18 to 15.
  • Minimizing the matrix multiplication exponent $\omega$ is an active area of research in computational complexity.

References

  • V. Strassen, "Gaussian elimination is not optimal," Numerische Mathematik, vol. 13, pp. 354–356, 1969. — DOI: 10.1007/BF02165411
  • T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Introduction to Algorithms, 4th ed., MIT Press, 2022. — Chapter 4.2: Strassen's algorithm for matrix multiplication.
  • N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM, 2002. — Chapter 23: Fast matrix multiplication.
  • Strassen algorithm — Wikipedia
  • Matrix multiplication algorithm — Wikipedia

Related articles