# Attention Mechanisms
> From quadratic full attention through linear approximations to recurrence-based alternatives that scale to arbitrary sequence lengths.

## Overview

Attention is the mechanism that made modern sequence modeling work at scale. The core idea is
simple: for each position in a sequence, compute a weighted combination of all other positions,
where the weights reflect relevance. This gives every token direct access to every other token,
enabling the kind of long-range dependency modeling that recurrent networks struggle with. The
cost is quadratic: computing all pairwise relevance scores for a length-L sequence requires
O(L^2) time and memory.

The 12 attention modules in Edifice trace three distinct paths away from that quadratic
bottleneck. The first path keeps the full attention formulation but reduces redundancy:
Grouped Query Attention (GQA) shares key-value heads across groups of query heads, cutting
KV cache size without meaningful quality loss. The second path replaces the softmax attention
matrix with a cheaper approximation: LinearTransformer uses kernel feature maps, Performer
uses random orthogonal features (FAVOR+), and Nystromformer uses landmark-based matrix
approximation. The third path abandons the attention matrix entirely in favor of recurrence:
RetNet uses exponential decay retention, RWKV uses weighted key-value aggregation, GLA adds
data-dependent gating to linear attention, HGRN uses hierarchical gating with state expansion,
and Griffin combines a gated linear recurrence with local attention. Two modules stand apart:
FNet replaces attention with an unparameterized Fourier Transform (no learned weights at all),
and Perceiver uses cross-attention to a learned latent array, decoupling compute from input size.

All modules follow a common block pattern: pre-normalization, the attention/mixing mechanism,
a residual connection, then a feed-forward network with another residual. They accept sequence
inputs of shape [batch, seq_len, embed_size] and output [batch, hidden_size] from the last
timestep.

## Conceptual Foundation

The standard attention computation proceeds in three steps. First, project the input into
queries Q, keys K, and values V. Second, compute attention weights as the softmax of scaled
dot products. Third, apply those weights to the values:

```
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

Where:
  Q = X W_Q    [L, d_k]     (what am I looking for?)
  K = X W_K    [L, d_k]     (what do I contain?)
  V = X W_V    [L, d_v]     (what do I provide?)

  Q K^T        [L, L]       (pairwise relevance scores)
  softmax(...)  [L, L]       (normalized attention weights)
  ... * V      [L, d_v]     (weighted combination of values)
```

Multi-head attention runs H independent attention computations in parallel, each with
dimension d_k = d_model / H, then concatenates and projects the results. This lets
different heads attend to different types of relationships.

The three efficiency paths can be understood as different ways to avoid materializing the
L x L attention matrix:

```
                    Efficiency Strategies
                    =====================

Path 1: Reduce Redundancy (keep full attention)
  +------------+     +------------+     +------------+
  | Multi-Head |     |    GQA     |     | Perceiver  |
  | (baseline) | --> | (shared KV)| ... | (latent    |
  |  H KV heads|     | G KV heads |     |  bottleneck|
  +------------+     +------------+     +------------+
  O(L^2)              O(L^2), less KV    O(L*M + M^2)

Path 2: Approximate the Attention Matrix
  +------------+     +------------+     +------------+
  |  Linear    |     | Performer  |     |Nystromformer|
  |Transformer | ... | (FAVOR+)   | ... | (landmarks)|
  | phi(Q)(KV) |     | random feat|     | Nystrom    |
  +------------+     +------------+     +------------+
  O(L*d^2)            O(L*d*m)           O(L*M)

Path 3: Replace with Recurrence
  +--------+     +--------+     +--------+     +--------+     +--------+
  | RetNet |     | RWKV   |     |  GLA   |     | HGRN   |     |Griffin |
  | (decay)|     | (WKV)  |     | (gated)|     | (hier.)|     |(RG-LRU)|
  +--------+     +--------+     +--------+     +--------+     +--------+
  O(L)            O(L)            O(L)           O(L)           O(L)

Special:
  +--------+
  |  FNet  |     No learned weights -- pure FFT mixing
  | (FFT)  |     O(L log L)
  +--------+
```

## Architecture Evolution

```
2017                2020              2021              2023              2024
 |                   |                 |                 |                 |
 Multi-Head Attn     Linear            FNet             RetNet            Griffin
 (Vaswani)          Transformer       (no-params FFT)   (decay retention) (RG-LRU+local)
 |                   |                 |                 |                 |
 |                  Performer          Perceiver        RWKV-7            HGRN-2
 |                  (FAVOR+)          (latent array)    (linear RNN)      (state expansion)
 |                   |                                   |
 |                  Nystromformer                        GLA
 |                  (landmarks)                         (gated linear)
 |
 GQA (2023, Llama 2)

  Full Quadratic          Subquadratic Approx.          Linear Recurrent
  ================        =====================         ================
  MultiHead, GQA          LinearTransformer,            RetNet, RWKV, GLA,
  Perceiver               Performer, Nystromformer,     HGRN, Griffin
                          FNet
```

## When to Use What

```
+-------------------+------------------+------------------+--------------------+
| Requirement       | Best Module      | Runner-up        | Why                |
+-------------------+------------------+------------------+--------------------+
| Maximum quality   | MultiHead        | GQA              | Full O(L^2) attn   |
| (short sequences) |                  |                  | captures everything|
+-------------------+------------------+------------------+--------------------+
| Inference speed   | GQA              | RetNet           | Fewer KV heads =   |
| (KV cache bound)  |                  |                  | smaller cache      |
+-------------------+------------------+------------------+--------------------+
| Streaming /       | RetNet, RWKV     | GLA              | O(1) per-step      |
| real-time         |                  |                  | recurrent mode     |
+-------------------+------------------+------------------+--------------------+
| Very long seqs    | Performer        | Nystromformer    | O(L) attention via |
| (>8K tokens)      |                  |                  | random features    |
+-------------------+------------------+------------------+--------------------+
| Multimodal /      | Perceiver        | --               | Handles arbitrary  |
| variable input    |                  |                  | input modalities   |
+-------------------+------------------+------------------+--------------------+
| Fastest training  | FNet             | LinearTransformer| Zero attention     |
| (quality tradeoff)|                  |                  | params (FFT only)  |
+-------------------+------------------+------------------+--------------------+
| Short seqs with   | GLA              | Griffin          | Data-dependent     |
| hardware efficiency|                 |                  | gating, native ops |
+-------------------+------------------+------------------+--------------------+
| Recurrence +      | Griffin          | HGRN             | RG-LRU layers +   |
| local attention   |                  |                  | periodic local attn|
+-------------------+------------------+------------------+--------------------+
| On-device /       | RWKV             | GLA              | O(1) inference,    |
| memory-limited    |                  |                  | fixed-size state   |
+-------------------+------------------+------------------+--------------------+
| Stable long-range | HGRN             | RetNet           | State expansion    |
| with O(1) memory  |                  |                  | enriches recurrence|
+-------------------+------------------+------------------+--------------------+
```

## Key Concepts

### Grouped Query Attention and KV Cache Efficiency

In standard multi-head attention with H heads, each head maintains its own key and value
projections. During autoregressive inference, the KV cache stores all past K and V tensors
for each head, consuming O(H * L * d) memory. GQA reduces this by grouping G query heads
to share a single KV head. With H=8 query heads and G=2 KV heads, each KV head serves 4
query heads, cutting KV cache by 4x:

```
MHA:  Q1-K1-V1  Q2-K2-V2  Q3-K3-V3  Q4-K4-V4   (4 KV heads, full cache)
GQA:  Q1-K1-V1  Q2-K1-V1  Q3-K2-V2  Q4-K2-V2   (2 KV heads, half cache)
MQA:  Q1-K1-V1  Q2-K1-V1  Q3-K1-V1  Q4-K1-V1   (1 KV head, minimal cache)
```

GQA is used in Llama 2 (70B), Mistral, and Gemma. The quality loss from KV sharing is
minimal because the key-value representations across heads are highly correlated in practice.

### Kernel Approximations: From Quadratic to Linear

The softmax attention matrix softmax(QK^T / sqrt(d)) can be decomposed using kernel methods.
If we find a feature map phi such that exp(q^T k) is approximately equal to phi(q)^T phi(k),
we can rewrite attention as:

```
Standard:   output = softmax(Q K^T) V            -- must form L x L matrix
Linear:     output = phi(Q) (phi(K)^T V)          -- form d x d matrix first
```

By computing phi(K)^T V first (a d x d matrix regardless of sequence length), we avoid the
L x L bottleneck entirely. Different modules use different feature maps:

- **LinearTransformer**: phi(x) = ELU(x) + 1 (simple, deterministic)
- **Performer**: phi(x) uses orthogonal random features to approximate the exponential
  kernel (FAVOR+ mechanism), with better approximation quality
- **Nystromformer**: takes a different approach entirely, sampling M landmark points via
  average pooling and reconstructing the attention matrix as F1 * pinv(F2) * F3

### The Retention / Recurrence Path

RetNet, RWKV, GLA, HGRN, and Griffin share a design philosophy: replace the attention matrix
with a recurrent state that can be updated incrementally. This gives O(1) per-token inference
(constant regardless of sequence length) while still allowing parallel training via scan-like
formulations.

**RetNet** uses multi-scale exponential decay: each head has a decay factor gamma_h, and the
retention mechanism computes s_n = gamma * s_{n-1} + K_n^T * V_n. Different heads use
different decay rates, capturing patterns at multiple timescales. RetNet supports three
computation modes with the same weights: parallel (training), recurrent (inference), and
chunkwise (long sequences).

**RWKV** (version 7, "Goose") uses a generalized delta rule with separate time-mixing and
channel-mixing sub-blocks. The time-mixing block computes weighted key-value aggregation
with learned per-channel decay. The channel-mixing block acts as a gated FFN. RWKV achieves
O(1) per-step inference and has been deployed on 1.5 billion Windows devices.

**GLA** adds data-dependent gating to linear attention. Rather than using fixed feature maps,
it computes gates from the input that modulate information flow. This makes GLA particularly
effective on short sequences (under 2K tokens) where it can outperform FlashAttention-2 in
wall-clock time.

**HGRN-2** uses hierarchical gating with state expansion: the hidden state is expanded to a
higher dimension during recurrence (enriching the representation), then contracted back for
output. The forget and input gates create a hierarchical structure where different layers
operate at different timescales.

**Griffin** combines a Real-Gated Linear Recurrent Unit (RG-LRU) with local sliding-window
attention. The RG-LRU is simpler than Mamba's SSM -- just gated decay with a
norm-preserving sqrt(1 - a^2) input scaling. The local attention layers appear periodically
(e.g., every 3rd layer) for short-range precise recall. The Hawk variant uses only RG-LRU
without any attention.

### FNet and Perceiver: Structural Alternatives

**FNet** demonstrates that the token mixing provided by attention can be partially replaced
by a parameter-free operation. Applying the Fast Fourier Transform along the sequence axis
gives global mixing (every token influences every other through frequency-domain
multiplication) at O(L log L) cost with zero learned parameters. FNet achieves 92-97% of
BERT's quality on standard benchmarks while training roughly 7x faster.

**Perceiver** solves a different problem: handling inputs of arbitrary size and modality.
Instead of self-attending over the full input (O(N^2) for N input elements), Perceiver
cross-attends the input to a small learned latent array of M vectors (M << N). Subsequent
processing happens on the latent array at O(M^2) cost. The total cost O(N*M + M^2) scales
linearly with input size for fixed M, making Perceiver applicable to point clouds, audio,
video, and other modalities that would be infeasible with standard attention.

## Complexity Comparison

```
+-------------------+----------+----------+-----------+---------+-------------+
| Module            | Training | Inference| KV Cache  | Learned | Recurrent   |
|                   | (per tok)| (per tok)| Size      | Params  | Mode?       |
+-------------------+----------+----------+-----------+---------+-------------+
| MultiHead         | O(L^2)   | O(L)     | O(H*L*d)  | QKV+out | No          |
| GQA               | O(L^2)   | O(L)     | O(G*L*d)  | QKV+out | No          |
| Perceiver         | O(L*M)   | O(M^2)   | O(M*d)    | QKV+lat | No          |
| FNet              | O(L logL)| O(L logL)| None      | FFN only| No          |
| LinearTransformer | O(L*d^2) | O(d^2)   | O(d^2)    | QKV+out | Yes         |
| Nystromformer     | O(L*M)   | O(L*M)   | O(M*d)    | QKV+out | No          |
| Performer         | O(L*d*m) | O(d*m)   | O(d*m)    | QKV+out | Yes         |
| RetNet            | O(L^2)*  | O(1)     | O(d^2)    | QKV+gate| Yes (triple)|
| RWKV              | O(L)     | O(1)     | O(d)      | WKV+ch  | Yes         |
| GLA               | O(L)     | O(1)     | O(d^2)    | QKV+gate| Yes         |
| HGRN              | O(L)     | O(1)     | O(E*d)    | Gates   | Yes         |
| Griffin            | O(L)+**  | O(1)+**  | O(d)+**   | RG-LRU  | Yes (hybrid)|
+-------------------+----------+----------+-----------+---------+-------------+

Notes:
- L = sequence length, d = head dimension, H = num heads, G = num KV groups
- M = num landmarks (Nystrom) or num latents (Perceiver)
- m = num random features (Performer)
- E = state expansion factor (HGRN)
- * RetNet parallel mode is O(L^2) but supports O(1) recurrent inference
- ** Griffin adds O(W^2) for local attention layers (W = local window size)
- "Recurrent Mode" = supports O(1) per-step incremental inference
```

## Module Reference

- `Edifice.Attention.MultiHead` -- Standard multi-head attention with optional sliding window and QK LayerNorm
- `Edifice.Attention.GQA` -- Grouped Query Attention with configurable num_kv_heads for KV cache reduction
- `Edifice.Attention.Perceiver` -- Cross-attention to learned latent array for input-agnostic processing
- `Edifice.Attention.FNet` -- Fourier Transform token mixing with zero attention parameters
- `Edifice.Attention.LinearTransformer` -- Kernel-based linear attention using ELU+1 feature maps
- `Edifice.Attention.Nystromformer` -- Nystrom method approximation with configurable landmark points
- `Edifice.Attention.Performer` -- FAVOR+ random orthogonal feature attention for O(N) complexity
- `Edifice.Attention.RetNet` -- Multi-scale exponential decay retention with parallel/recurrent/chunkwise modes
- `Edifice.Attention.RWKV` -- RWKV-7 "Goose" with time-mixing (WKV attention) and channel-mixing (gated FFN)
- `Edifice.Attention.GLA` -- Gated Linear Attention with data-dependent decay and hardware-efficient training
- `Edifice.Attention.HGRN` -- HGRN-2 with hierarchical gating, state expansion, and contraction
- `Edifice.Attention.Griffin` -- Griffin (RG-LRU + local attention) and Hawk (RG-LRU only) variants

## Cross-References

- **[State Space Models](state_space_models.md)** -- SSM hybrid architectures (Jamba, Zamba,
  HybridBuilder) integrate MultiHead attention as periodic layers within SSM stacks. Griffin
  bridges this family with its RG-LRU, which is closely related to a diagonal SSM.
- **Building Blocks** -- RoPE (Rotary Position Embedding) and ALiBi (Attention with Linear Biases)
  provide position information to attention mechanisms. RMSNorm and LayerNorm are used for
  pre-normalization in all attention blocks.
- **Recurrent Networks** -- HGRN and Griffin straddle the boundary between attention and recurrence.
  Their gated recurrences are functionally similar to GRU/LSTM cells but with linear (not
  saturating) dynamics that enable parallel scan training.
- **[Generative Models](generative_models.md)** -- DiT (Diffusion Transformer) uses MultiHead
  attention as its backbone, with AdaptiveNorm for conditioning.

## Further Reading

1. Vaswani et al. "Attention Is All You Need." NeurIPS 2017. -- The transformer paper that
   established multi-head attention as the dominant sequence modeling primitive.
2. Ainslie et al. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head
   Checkpoints." arXiv:2305.13245 -- Interpolating between MHA and MQA for inference efficiency.
3. Katharopoulos et al. "Transformers are RNNs: Fast Autoregressive Transformers with Linear
   Attention." ICML 2020. -- The kernel trick that reduces attention from O(L^2) to O(L).
4. Sun et al. "Retentive Network: A Successor to Transformer for Large Language Models."
   arXiv:2307.08621 -- Multi-scale retention with parallel/recurrent/chunkwise triple paradigm.
5. De et al. "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient
   Language Models." arXiv:2402.19427 -- RG-LRU combined with local attention for a practical
   hybrid architecture.