Dao 2022 - Flash Attention

Paper Link

This paper argues that the attention mechanism is slow because of reading / writing between GPU High Bandwidth Memory and GPU on-chip SRAM. The authors hence create a block-wise attention algorithm that minimizes such IO read / writes and speeds up attention significantly especially when the sequence length is long.

Brief Overview of Attention

Suppose we have an input sequence of embeddings $X = (x_{1}, ..., x_{N})$ where $x_{i} \in R^{k}$ , such that $X \in R^{k \times N}$ . Naively, we can compute activations by $V = X^{T} \cdot W_{v}$ , where $W_{v} \in R^{k \times d}$ , such that $V \in R^{N \times d}$ . However, this naive way of encoding our input sequence does not allow interaction between inputs at different positions (say $x_{i}$ with $x_{j}$ ). We can see this by observing that the first row of $V$ is only affected by the first column of $X$ (i.e. first encoding $x_{1}$ ), and likewise for all the other positions.

Attention addresses this problem by adding an interaction mechanism. Besides $W_{v}$ , we also create weight parameters $W_{q}, W_{k} \in R^{k \times d}$ . Given an input $X$ , we compute $Q, K, V \in R^{N \times d}$ as follows:

$Q = X^{T} \cdot W_{q}$
$K = X^{T} \cdot W_{k}$
$V = X^{T} \cdot W_{v}$

We then create an interaction matrix $S = Q K^{T} \in R^{N \times N}$ , and apply row-wise softmax to get $P = so f t ma x (S) \in R^{N \times N}$ . $S$ can be thought of as a pairwise similarity matrix between the encoding at position $i$ and position $j$ that captures the degree of interaction. For example, in a sentence the economy has been in decline, the value of $S_{1, 5}$ (assuming 0-index) measuring the interaction between economy and decline might be high.

Finally, we produce the output $O = P V \in R^{N \times d}$ , which is an activation output from the input sequence that has captured the interactions between tokens at different positions of the input. This simple mechanism has led to significant improvements in language modelling.

GPU Memory Hierarchy

The memory hierarchy is such that read/write speed is super fast on the SRAM but memory is highly limited. Hence, the N x N attention matrix is written/read repeatedly to/from HBM, resulting in IO being a bottleneck. The numbers are as such on an A100 GPU:

SRAM: 19 TB/s (20 MB RAM)
HBM: 1.5 TB/s (40 GB RAM)

Naive Attention Algorithm

The naive attention algorithm has many reads and writes to HBM. (ps: Not sure why we cannot persist the intermediate matrices on SRAM and complete the computations, but in any case the naive algorithm requires materializing the $N \times N$ matrices on SRAM which will quickly flood it. For example, a sequence length of 2,048 at float32 already takes up 33MB for the $S$ matrix).

Load $Q, K$ from HBM, compute $S = Q K^{T}$ , write $S$ to HBM
Read $S$ from HBM, compute $P = so f t ma x (S)$ , write $P$ to HBM
Load $P, V$ from HBM, compute $O = P V$ , write $O$ to HBM

Flash Attention

The main idea is quite simple: instead of computing the full attention matrix, we use block-wise tiling to compute parts of it at a time. This reduces the memory required for each block and allows the whole computation to be done on SRAM while minimizing the amount of IO read from HBM, leading to faster compute time and lower memory usage on SRAM. The difficulty is in devising a block-wise softmax algorithm that yields the exact same result as computing it all at once.

Consider the naive softmax algorithm on an arbitrary vector $x \in R^{B}$ .

$m (x) p (x) l (x) so f t ma x (x) := i max x_{i} := [e^{x_{i} - m (x)} ... e^{x_{B} - m (x)}] := i \sum p (x)_{i} := \frac{p ( x )}{l ( x )}$

Note that the maximum value $m (x)$ is subtracted for numerical stability to avoid overflow (underflow is ok because $e^{- in f} = 0$ ). $f (x) \in R^{B}$ is the numerator and $l (x) \in R$ is the sum of all elements in $p (x)$ .

Now, the problem with the naive softmax algorithm in the context of attention is that we need an entire row of $S$ ( $N$ elements) to perform the row-wise softmax computation. This will not be available if we are performing block-wise computation, since we are splitting $Q, K \in R^{N x d}$ row-wise into blocks of $Q_{i}, K_{j} \in R^{B x d}$ . When we compute $S_{ij} := Q_{i} \cdot K_{j}^{T} \in R^{B x B}$ , blocks of $S$ will be materialized in each pass, but not the entire row at a time.

Hence, we need a modified algorithm that allows us to compute chunks of the final output $O \in R^{N x d}$ at a time by iterating block-wise through $S$ , such that the combination of the new chunk of $O$ at each step with the already written intermediate $O$ gives the correct result at the end. The key to realizing this algorithm is in decomposing the softmax step, as shown below.

Consider two vectors $x^{a}, x^{b} \in R^{B}$ . We can decompose the softmax of their concatenated vector $x = [x^{a} x^{b}] \in R^{2 B}$ as follows:

$m (x) p (x) l (x) so f t ma x (x) = ma x (m (x^{a}), m (x^{b})) = [e^{m (x^{a}) - m (x)} \cdot p (x^{a}) e^{m (x^{b}) - m (x)} \cdot p (x^{b})] = e^{m (x^{a}) - m (x)} \cdot l (x^{a}) + e^{m (x^{b}) - m (x)} \cdot l (x^{b}) = \frac{p ( x )}{l ( x )}$

The first line of the above simply notes that the maximum of $x$ is the maximum over each of the subvector maximums $x^{a}, x^{b}$ . The second line notes that we previously multiplied each element of $p (x)$ by a factor, say $e^{- m (x^{a})} \cdot e^{x_{i}}$ for those in $x_{a}$ . To get the correct multiplier for the full vector $x$ , we need to divide away the previous multiplier and apply the new multiplier, i.e. $e^{- m (x)} / e^{- m (x^{a})} = e^{m (x^{a}) - m (x)}$ . The third line notes that the new denominator is the sum over each of the subvector sums, after we apply the correct multiplier from line 2.

The decomposition is simple but powerful. It implies that so long as we keep track of intermediate statistics $m (x)$ and $l (x)$ , we can compute the softmax of a long vector $x$ by splitting $x$ into subvectors and operate over each subvector at a time.

Now we are ready for Algorithm 1: Flash Attention of the paper.

$Require: Q, K, V \in R^{N x d} 01. Initialize O = 0^{N x d}, l = 0^{N}, m = - in f^{N} 02. Divide Q, K, V row-wise into B -sized blocks of Q_{i}, K_{j}, V_{j} \in R^{B x d} 03. Divide O row-wise into B -sized blocks of O_{i} \in R^{B x d} 04. Divide m, l into B -sized arrays of m_{i}, l_{i} \in R^{B} 05. for j in 1, ... : 06. Load K_{j}, V_{j} from HBM to SRAM 07. for i in 1, ... : 08. Load Q_{i}, O_{i}, l_{i}, m_{i} from HBM to SRAM 09. Compute S_{ij} = Q_{i} \cdot K_{j}^{T} \in R^{B x B} 10. Compute m_{ij} = ro w ma x (S_{ij}) \in R^{B} 11. Compute P_{ij} = e^{S_{ij} - m_{ij}} \in R^{B x B} 12. Compute l_{ij} = ro w s u m (S_{ij}) \in R^{B} 13. Compute m_{i}^{n e w} = elementwise-max (m_{i}, m_{ij}) \in R^{B} 14. Compute l_{i}^{n e w} = e^{m_{i} - m_{i}^{n e w}} \cdot l_{i} + e^{m_{ij} - m_{i}^{n e w}} \cdot l_{ij} \in R^{B} 15. Write O_{i} \leftarrow (l^{n e w})^{- 1} \times [l_{i} \times e^{m_{i} - m_{i}^{n e w}} \times O_{i} + e^{m_{ij} - m_{i}^{n e w}} \times P_{ij} V_{j}] to HBM 16. Write l_{i} \leftarrow l_{i}^{n e w}, m_{i} \leftarrow m_{i}^{n e w} to HBM 17. Return O (1) (2) (3) (5) (7)$

Note that we use $0^{N x d}, 0^{N}$ to denote a zero matrix of size $N x d$ and a zero array of size $N$ respectively. For simplicity, we divide $Q, K, V$ into equal $B$ -sized blocks but the paper allows different block sizes for $Q$ and $K, V$ . The equation numbers on the right in parentheses show which equations the lines correspond to above. Equation line 15 is a bit confusing because it combines multiple steps together. The next few paras try to unpack this.

Firstly, note that we are using the $\times$ operator to denote an element-wise broadcasted multiplication. For a vector $l_{i} \in R^{B}$ and a matrices $P_{ij} \in R^{B x B}, V_{j} \in R^{B x d}$ , observe the associative property $(l_{i} \times P_{ij}) \cdot V_{j} = l_{i} \times (P_{ij} \cdot V_{j})$ , since each element of $l_{i}$ only affects the corresponding row in the final matrix. This allows us to apply the scaling to either $O_{i}$ or $P_{ij}$ and the result will be the same.

Next, see that the term $e^{m_{ij} - m_{i}^{n e w}} \times P_{ij} V_{j}$ is simply the corrected numerator of the softmax dotted with $V_{j}$ . Dividing this term by $l^{n e w}$ gives the output block for this particular $S_{ij}, V_{j}$ pair.

Similarly, the other term $l_{i} \times e^{m_{i} - m_{i}^{n e w}} \times O_{i}$ is the existing output that has been accumulated from previous steps ${S_{ij}, V_{j} : j = 1, ...}$ . Due to the associative property, we can also directly apply the scaling correction to $O_{i}$ . The $l_{i} / l^{n e w} \times e^{m_{i} - m_{i}^{n e w}}$ are scaling factors according to equations (6), (8) to correct the scaling of previous steps.

Finally, we should understand why there is a + in equation 15. I find it easier to visualize if we set $B = 1$ . If we trace the matrix multiplications, we will observe that $O_{i} \in R^{1 x d}$ is only affected by $Q_{i} \in R^{1 x d}$ , i.e. it corresponds to only the query token in position $i$ . Now, $O_{i}$ represents the weighted average over all $N$ positions of the $V$ matrix where the weights are determined by the softmax over the interaction between $Q_{i}$ (representing one token) and all $N$ positions on the $K$ matrix. This weighted average is why it is a $+$ symbol: we are accumulating the weighted sum over $V_{j}$ into $O_{i}$ . The only complication is that we are applying the scaling corrections at each step.

Hopefully these explanations provide some intuition to the FlashAttention algorithm, which is quite a simple idea but makes a ton of difference practically. It should be easy to implement this algorithm in numpy if the reader wishes to understand it better.

Keyboard shortcuts

Chux's Notebook

Dao 2022 - Flash Attention

Brief Overview of Attention

GPU Memory Hierarchy

Naive Attention Algorithm

Flash Attention