The Matrix Multiply Flip Graph

Two months of matrix multiply, of regret, loss, spring, sunburn, sore muscles, and a return to my bike. This post resumes the search for ways to multiply matrices with few multiplications. Some nihilistic hedonism propels me to continue this, but lately I reflect on how much I miss while absorbed by these things. I describe an interesting object called the matrix multiply flip graph, in which many world-record matrix multiply schemes have been found. We'll prove something new about its connectivity, then consider how to efficiently explore it using a GPU. The GPU-accelerated search procedure is quite fast, recovering the existing world-record scheme for $4\times 4$ matrix multiply in around thirty seconds on an RTX Pro 6000, compared to many GPU-weeks in the original publication. As of yet though, I have no new world-record schemes to report, hence this being a blog post.

The vertices of the matrix multiply flip graph are ways to multiply matrices. Some ways require fewer multiplications than others; because multiplication is expensive in hardware, considerable effort has been spent exploring the flip graph to find them.

Concretely, the vertices of the matrix multiply flip graph are rank- $R$ decompositions of the matrix multiply tensor.

Definition 1. The matrix multiply tensor for multiplying two $n$ by $n$ matrices (note 1) is the unique $M^{(n)}$ satisfying, for any $n$ by $n$ matrices $D,E$ ,

\tag{1} (DE)_{k,l} = \sum_{g,h,i,j=1}^{n}M^{(n)}_{g,h,i,j,k,l}D_{g,h}E_{i,j}.

That is, $M^{(n)}$ has shape $(n,n,n,n,n,n)$ and picks what products of elements of $D$ and $E$ sum to each element of $DE$ .

$M^{(n)}$ is said to have a rank- $R$ decomposition if

\tag{2} M^{(n)}=\sum_{r=1}^R A^{(r)}\otimes B^{(r)}\otimes C^{(r)},

for matrices $A^{(i)},B^{(i)},C^{(i)}$ with shape $(n,n)$ , where $(A\otimes B)_{i,j,k,l}=A_{i,j}B_{k,l}$ denotes the outer product. It is not too difficult to see that $DE$ can be computed with $R$ multiplications iff $M^{(n)}$ has a rank- $R$ decomposition, as after substituting (2) into (1) one arrives at

\begin{aligned} &(DE)_{k,l} \\ &= \sum_{g,h,i,j} (\sum_rA^{(r)}\otimes B^{(r)}\otimes C^{(r)})_{g,h,i,j,k,l}D_{g,h}E_{i,j} \\ &= \sum_{g,h,i,j} \sum_rA^{(r)}_{g,h}D_{g,h} B^{(r)}_{i,j}E_{i,j} C^{(r)}_{k,l} \\ &= \sum_rC^{(r)}_{k,l}\sum_{g,h,i,j}A^{(r)}_{g,h}D_{g,h} B^{(r)}_{i,j}E_{i,j} \\ &= \sum_rC^{(r)}_{k,l}\underbrace{(\sum_{g,h}A^{(r)}_{g,h}D_{g,h})(\sum_{i,j}B^{(r)}_{i,j}E_{i,j})}_{m_r}. \end{aligned}

As the $m_r$ terms do not depend on the values of $k,l$ , they can be precomputed. The $R$ $m_r$ terms are the only multiplications involving elements of $D$ and $E$ required for computing $(DE)_{k,l}$ , so $DE$ is said to computable with $R$ multiplies iff $M^{(n)}$ has a rank- $R$ decomposition.

It is natural then to say,

Definition 2. The vertices of the matrix multiply flip graph for multiplication of $n$ by $n$ matrices are all multisets

S=\{A^{(1)}\otimes B^{(1)}\otimes C^{(1)},\dots,A^{(|S|)}\otimes B^{(|S|)}\otimes C^{(|S|)}\}

satisfying $\sum_{s\in S}s = M^{(n)}$ and having no elements that evaluate to zero.

It would be convenient if $S$ was connected to $S'$ iff $|S'|<|S|$ , but it is, as of yet, not known how to compute the out-neighborhood of a vertex in this graph. Instead, work thus far uses some subset of easily-computable flip, plus, and reduction edges. An interesting property of the edge-choice is if the resulting graph is connected, i.e. from any vertex every other vertex is reachable, as in a such a graph all unknown ways to multiply matricies are reachable from all known ones. In their work introducing the flip graph, Kauers and Moosbauer [1] showed with flip and reduction edges the resulting graph is weakly connected; after treating reduction edges as undirected, the graph is connected. Then, Arai, Yuma, and Hukushima [2] showed that full connectivity was achievable with a third edge, plus.

Despite connectivity proofs being for arbitrary fields, to our knowledge, all existing implementations search for schemes over $\mathbb{F}_2$ , as over this field extremely fast search implementations are possible. Hence, this note investigates what edges are required for connectivity over $\mathbb{F}_2$ . It is shown that over $\mathbb{F}_2$ flip and plus suffice.

Definition 3. If $S$ can be transformed into $S'$ via a flip or plus transform, then there is a flip/plus edge from $S$ to $S'$ in the matrix multiply flip graph.

A flip on the $A$ position transforms $S$ into $S'$ if there exists $A\otimes B\otimes C,A'\otimes B'\otimes C'\in S$ such that $A=A'$ and

\begin{aligned} S'=S&\setminus\{A\otimes B\otimes C,A'\otimes B'\otimes C'\}\\&\cup \{A\otimes B\otimes (C+C'),A'\otimes (B'-B)\otimes C'\}. \end{aligned}

A plus on the $A$ position transforms $S$ into $S'$ if, there exists $A\otimes B\otimes C,A'\otimes B'\otimes C'\in S$ ,

\begin{aligned} S'=S&\setminus\{A'\otimes B'\otimes C'\}\\&\cup\{(A'-A)\otimes B'\otimes C', A\otimes B'\otimes C'\}, \end{aligned}

Flip and plus transforms are defined likewise to operate on the $B$ and $C$ positions of summands. If a flip transform results in a zero summand, that summand is dropped.

Theorem 4. Over $\mathbb{F}_2$ , the matrix multiply flip graph with flip and plus edges is connected.

Connectivity Proof

Lemma 5. *The matrix multiply tensor for multiplying two $n$ by $n$ matrices is,

M^{(n)}=\sum_{i=1}^n \sum_{j=1}^n \sum_{k=1}^n E_{ij} \otimes E_{jk} \otimes E_{ik}

where $E_{ij}$ is a $n$ by $n$ matrix with a $1$ at position $ij$ and zeros everywhere else.*

Proof. As a consequence of (1), $M^{(n)}_{ghijkl}=\delta_{g,k}\delta_{h,i}\delta_{j,l}$ where $\delta_{g,h}$ denotes the Kronecker delta, so inspecting the basis-decomposition of $M^{(n)}$ we have,

\begin{aligned} M^{(n)} &= \sum_{i,j,j',k,i',k'}\delta_{i,i'}\delta_{k,k'}\delta_{j,j'}E_{i,j}\otimes E_{j',k} \otimes E_{i',k'} \\ &= \sum_{i,k,j}E_{i,j}\otimes E_{j,k}\otimes E_{i,k}. \end{aligned}

Lemma 6. *If $S=\{A^{(1)}\otimes B^{(1)}\otimes C^{(1)},\dots\}$ is a decomposition of $M^{(n)}$ , then

\begin{aligned} &\text{span}(A^{(1)},\dots,A^{(|S|)})\\&=\text{span}(B^{(1)},\dots,B^{(|S|)})\\&=\text{span}(C^{(1)},\dots,C^{(|S|)})=\mathbb{F}^{n\times n}. \end{aligned}

Proof. Write each $B^{(r)}$ and $C^{(r)}$ in the standard basis

B^{(r)}=\sum_{j,k'}^{m,p}\beta^{(r)}_{j,k'}E_{j,k'}, \quad C^{(r)}=\sum_{i,k}^{n,p}\gamma^{(r)}_{i,k}E_{i,k}.

By the definition of $S$ and the basis-form of $M_{(n)}$ , we have

\begin{aligned} &\sum_{i,j,k}^{n,m,p}E_{i,j}\otimes E_{j,k} \otimes E_{i,k}\\ &= \sum_r^RA^{(r)}\otimes B^{(r)}\otimes C^{(r)} \\ &= \sum_{r}^RA^{(r)}\otimes(\sum_{j,k'}^{m,p}\beta^{(r)}_{j,k'}E_{j,k'})\otimes (\sum_{i,k}^{n,p}\gamma^{(r)}_{i,k}E_{i,k}) \\ &= \sum_{r}^R\sum_{j,k',i,k}^{m,p,n,p}(A^{(r)}\beta^{(r)}_{j,k'}\gamma^{(r)}_{i,k})\otimes E_{j,k'} \otimes E_{i,k} \\ &= \sum_{j,k',i,k}^{m,p,n,p}(\sum_{r}^{R}A^{(r)}\beta^{(r)}_{j,k'}\gamma^{(r)}_{i,k})\otimes E_{j,k'} \otimes E_{i,k}. \end{aligned}

Comparing the coefficient of the basis matrix $E_{j,k'}\otimes E_{i,k}$ on the left and right hand side of the equality, for any $i,j$ when $k=k'$ we have $E_{i,j}=\sum_{r}^{R}A^{(r)}\beta^{(r)}_{j,k}\gamma^{(r)}_{i,k}$ , so every $E_{i,j}$ is a linear combination of $A^{(r)}$ 's. A symmetric argument applies to the $C$ and $B$ terms.

Lemma 7. *Let $S$ be a decomposition of $M^{(n)}$ over $\mathbb{F}_2$ and $A\otimes B\otimes C$ , $A'\otimes B'\otimes C'$ be elements of $S$ . There is a path from $S$ to

S_{A+} = S \cup \{ (A+A')\otimes B\otimes C,\; (A+A')\otimes B\otimes C \},

in the matrix multiply flip graph; likewise for $S_{B+}$ and $S_{C+}$ .*

Figure 1. Derivation of Lemma 7's $S_{A+}$ when $B\neq B'$ .

Proof. When $B\neq B'$ , Figure 1 shows a derivation of $S_{A+}$ .

If $B+B'=0$ , a slightly different route is needed. After the first $P_A$ step we have

A'\otimes B\otimes C',\quad X\otimes B\otimes C,\quad A'\otimes B\otimes C,

where $X=A+A'$ . Applying $P_A$ to the last two summands, using $X\otimes B\otimes C$ as the pivot, gives

A'\otimes B\otimes C',\quad X\otimes B\otimes C,\quad A\otimes B\otimes C,\quad X\otimes B\otimes C.

as desired. The $S_{B+}$ and $S_{C+}$ cases are symmetric.

Lemma 8. *Let $S$ be a decomposition of $M^{(n)}$ over $\mathbb{F}_2$ . For arbitrary, nonzero $X,Y,Z$ , there is a path from $S$ to

S \cup \{ X\otimes Y\otimes Z,\; X\otimes Y\otimes Z \},

in the matrix multiply flip graph.*

Proof. Fix a summand $A\otimes B\otimes C \in S.$ By Lemma 6, the $A$ -components of the summands of $S$ span $\mathbb{F}_2^{n\times n}$ , so there are terms $A^{(1)},\dots,A^{(m)}$ satisfying

X = A + A^{(1)}+\cdots+A^{(m)}.

Thus, Lemma 7 gives a procedure to add two copies of

(A+A^{(1)})\otimes B\otimes C,

then

(A+A^{(1)}+A^{(2)})\otimes B\otimes C,

and so on, until two copies of $X\otimes B\otimes C$ are generated, at which point flips eliminate intermediate terms yielding $S\cup\{X\otimes B\otimes C,X\otimes B\otimes C\}$ . This is then repeated this on the $B$ and $C$ positions.

Lemma 9. Let $S$ be a decomposition of $M^{(n)}$ over $\mathbb{F}_2$ . If $R\subseteq S$ sums to zero, i.e. $\sum_{r\in R}r=0$ , then there is a path from $S$ to $S\setminus R$ in the matrix multiply flip graph.

Proof. Let $E_{i,j}$ denote the $n$ by $n$ basis matrix with a $1$ at position $i,j$ and zeros everywhere else. First, decompose every element of $R$ into basis matrices as follows: If the $A$ term of $A\otimes B\otimes C\in R$ is not already a basis matrix, then $A=E_{i_1,j_1}+\dots+E_{i_t,j_t}$ , so add $E_{i_1,j_1}\otimes B\otimes C$ twice via Lemma 8

A\otimes B\otimes C, \quad E_{i_1,j_1}\otimes B\otimes C, \quad E_{i_1,j_1}\otimes B\otimes C

then do a flip on the $B$ position of the first and second terms above for

(A-E_{i_1,j_1})\otimes B\otimes C, \quad E_{i_1,j_1}\otimes B\otimes C.

Repeating this for each element of $A$ 's decomposition and the $B$ and $C$ position decomposes $A\otimes B\otimes C\in R$ into its basis matrices.

After decomposing $R$ the resulting scheme is $(S\setminus R)\cup R'$ , where $R'$ is a multiset of basis-matrix outer products. Notice that the basis-matrix outer products are linearly independent, as $E_{i,j}\otimes E_{k,l}\otimes E_{m,n}$ is itself a basis tensor with a one at position $(i,j,k,l,m,n)$ and zeros everywhere else. Thus, for $\sum_{r\in R'}r$ to be zero, every basis-matrix outer product in $R'$ must appear an even number of times (i.e. its coefficient must be $0$ over $\mathbb{F}_2$ ). Hence, as over $\mathbb{F}_2$ a flip removes both inputs if they are the same, applying flips to identical elements removes $R'$ , giving $S\setminus R$ , as desired.

Proof of Theorem 4. To travel from any scheme $S$ to any scheme $S'$ , first transition to $S\cup S' \cup S'$ by adding every element of $S'$ to $S$ twice via Lemma 8. Then, as $\sum_{s\in S}s=\sum_{s\in S'}s=M^{(n)}$ , $\sum_{s\in S\cup S'}s=0$ , so $R=S\cup S'$ is a subset of $S\cup S'\cup S'$ that sums to zero and can be removed per Lemma 9, leaving $S'$ , as desired.

GPU-Accelerated Search

We now, somewhat unceremoniously, turn our focus towards how to efficently search the flip graph. As of yet, no better way to do this than random walks is known.

The most performance-sensitive part of random walks on the flip graph is identifying flip opportunities, i.e. summands that match in the $A$ , $B$ , or $C$ position.

\overbrace{A\hspace{0.4em}\otimes \underbrace{B}_{\mathclap{\text{Term at $B$ position}}}\otimes\hspace{0.4em} C}^{\text{Summand}}

Over $\mathbb{F}_2$ an $8\times 8$ term fits in a single $64$ -bit word, so existing flip graph search procedures for small matrix sizes maintain, for each position $A,B,C$ , a map that stores the indices of summands with a particular value at that value. For example, if summands at indices $1,2,3$ have the value $x$ for their $A$ a term and $m_A$ is the map for $A$ terms, $m_A(x)=[1,2,3]$ . For very small matrix sizes, $4\times 4$ or less, this map can be implemented with an array by interpreting term's bit-level representation as an index and larger sizes require hashing. However, this approach is not GPU friendly as it is branchy, requires operating on sizable buffers in shared memory, and, generally, is unclear how to adapt to the single-instruction-multiple-threads model used by GPUs (note 2).

Fortunately, recent CUDA GPUs have a primitive, __match_any_sync, that returns a bitmask of threads within a warp that have the same value for a given variable.

\begin{aligned} &\texttt{\_\_match\_any\_sync(}x\texttt{)}_i=1 \\&\quad\leftrightarrow \text{Thread $i$ called }\texttt{\_\_match\_any\_sync}\text{ on $x$} \end{aligned}

When there are fewer summands than threads in a warp this can be used as follows: Each thread stores a single summand. At each step all threads agree on a term position ( $A$ , $B$ , or $C$ ) and execute __match_any_sync on their summand's value at that position. Ones in the resulting bitmasks correspond to flip opportunities. This means flips are identified with a hardware primitive instead of a hash table.

In practice, available GPUs have $32$ threads per warp. As world-record decompositions of $M^{(4)}$ and $M^{(5)}$ involve $47$ and $93$ summands respectively, multiple summands must be stored per thread. Flip search then becomes probabilistic: At each step warps agree on a term position then, for $k$ rounds, each thread picks a random summand and calls __match_any_sync. Increasing $k$ increases the likelihood that available flip opportunities are found.

Generalized Flips

With __match_any_sync it is zero-cost to identify multiple summands that share a term, as this corresponds to more than two bits being set in the returned bitmask. Flip semantics can be generalized to operate on multiple summands and used to look ahead for rank-reductions.

Lets sketch a flip opportunity as matrix multiplication over subterms with $\otimes$ as multiply.

\begin{aligned} &A\otimes (B^{(1)}\otimes C^{(1)} + B^{(2)}\otimes C^{(2)}) \\&= A\otimes \bigl(\underbrace{\begin{bmatrix} B^{(1)} & B^{(2)} \end{bmatrix}}_{B}\underbrace{\begin{bmatrix} C^{(1)} \\ C^{(2)} \end{bmatrix}}_{C}\bigr) \end{aligned}

From this perspective, applying a flip has the same effect as transforming $AB$ , above, into $AGG^{-1}B$ for a particular $G$ .

\begin{aligned} &A\otimes \bigl(\begin{bmatrix} B^{(1)} & B^{(2)} \end{bmatrix} \overbrace{\begin{bmatrix} 1 & 0 \\ -1 & 1 \end{bmatrix}}^{G}\overbrace{\begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix}}^{G^{-1}}\begin{bmatrix} C^{(1)} \\ C^{(2)} \end{bmatrix}\bigr) \\ &= A\otimes(B^{(1)}-B^{(2)})\otimes C^{(1)}+A\otimes B^{(2)}(C^{(1)}+C^{(2)}) \end{aligned}

This is easily generalized to multiple summands with a shared term and arbitrary, invertible $G$ .

\begin{aligned} &A\otimes (\sum_{i=1}^N B^{(i)}\otimes C^{(i)}) \\&= A\otimes \bigl(\begin{bmatrix} B^{(1)} & \dots & B^{(N)} \end{bmatrix}GG^{-1}\begin{bmatrix} C^{(1)} \\ \dots \\ C^{(N)} \end{bmatrix}\bigr) \end{aligned}

A consequence of this framing is, if the $B^{(i)}$ or $C^{(i)}$ terms are not lineally independent, then appropriate choice of $G$ results in a zero in the $B$ or $C$ vectors, and a zero corresponds to a rank-reduction. Thus appropriate choice of $G$ can skip ahead in the search space to find rank-reductions that may require many pairwise flip and plus transforms to set up the needed linear combination.

It remains to discuss how to efficiently determine if $B^{(1)},\dots,B^{(N)}$ are linealy dependent on GPU. First, it seems to be the case that sets of summands sharing a term are overwhelmingly of size $\leq 5$ on the $5\times 5$ flip graph: in $3,273$ schemes sampled from random-walks there were $43,386$ sets of more than two summands sharing a term and $99.93\%$ had size $\leq 5$ ; Figure 2 visualizes this. Over $\mathbb{F}_2$ determining if $B^{(1)},\dots,B^{(5)}$ is lineally dependent reduces to determining if the sum (i.e. XOR, i.e. $\oplus$ ) of any subset is zero. So, $5$ is a useful number, because as sets of size $5$ have $31$ non-empty subsets and warps have $32$ threads, we can assign each subset to a thread.

Figure 2. Distribution of sizes of sets of summands sharing a term from $3,273$ schemes sampled from random-waks on the $5\times 5$ flip graph.

More precisely, let $s$ be the index of a thread in a warp and $D_s$ be the indices of the set bits of $s$ . Thread zero is idle while threads $s\neq 0$ compute

\bigoplus_{i\in D_s}B^{(i)}=0.

A warp ballot then finds the lowest-index thread for which the above is true. If one exists, call it thread $m$ and the index of its first set bit $r=\min D_m$ , threads rewrite their summands from

\sum_{i=0}^{N-1} A\otimes B^{(i)}\otimes C^{(i)} \quad\text{to}\quad \sum_{i\neq r} A\otimes B^{(i)}\otimes C'^{(i)}

where

C'^{(i)}=\begin{cases} C^{(i)}\oplus C^{(r)} \quad &i\in D_m, i\neq r \\ C^{(i)} \quad &i\not\in D_m \end{cases}

and the $r$ th summand is removed, i.e. set to zero.

To see the correctness of this,

\begin{aligned} &\sum_{i\neq r} A\otimes B^{(i)}\otimes C'^{(i)} \\ &= \left(\sum_{i\neq r} A\otimes B^{(i)}\otimes C^{(i)}\right) \oplus \sum_{i\in D_m, i\neq r} A\otimes B^{(i)}\otimes C^{(r)} \\ &= \left(\sum_{i\neq r} A\otimes B^{(i)}\otimes C^{(i)}\right) \oplus A\otimes B^{(r)}\otimes C^{(r)} \\ &= \sum_{i=0}^{N-1} A\otimes B^{(i)}\otimes C^{(i)}. \end{aligned}

Notes

The exposition is on square matrix multiplication but is easily generalized.
At a the necessary level of abstraction for this section, GPUs are composed of many processors composed of threads that all execute the same instruction at the same time. For example, two threads on a processor can simultaneously compute $x+y$ , but one cannot compute $x+y$ while the other computes $x-y$ , as $+$ and $-$ are different instructions. Branchy code performs poorly on GPUs because one thread cannot execute the false branch of an if statement while the other executes the true branch. Some primitives, like __match_any_sync, allow threads in a warp to communicate with one another. If you'd like to understand GPU in more detail I made a considerable effort to explain in this post.

References

Flip Graphs for Matrix Multiplication. Manuel Kauers, Jakob Moosbauer. arXiv preprint arXiv:2212.01175. 2022.
Adaptive Flip Graph Algorithm for Matrix Multiplication. Yamato Arai, Yuma Ichikawa, Koji Hukushima. arXiv preprint arXiv:2312.16960. 2023.

Thanks to Ally, Anagha, and Quan.