# Core Vocabulary
> The essential terminology of machine learning, defined precisely and connected to how Edifice uses each concept.

## Why a Vocabulary Guide?

Every Edifice guide uses these terms. The existing architecture guides (attention mechanisms,
state space models, generative models, etc.) assume you already know what "embedding" and
"softmax" mean. This guide is your reference -- skim it now to build familiarity, then come
back whenever you hit a term you don't recognize.

Terms are grouped by how you encounter them: first the data, then the network, then training,
then evaluation.

## Data

### Features
The individual measurements or attributes that describe each data point. In an image, each pixel
value is a feature. In a game state, the player's x-position, y-position, damage percentage,
and action state are all features. Features are the raw inputs your network sees.

### Labels
The correct answers for supervised learning. If you're training a network to classify images of
cats and dogs, the label for each image is "cat" or "dog." If you're predicting a player's next
action, the label is the action they actually took. Not all ML tasks have labels -- generative
models and self-supervised methods learn from the data itself.

### Samples (Examples)
A single data point: one image, one game frame, one sentence. Your dataset is a collection of
samples, and each sample consists of features (input) and optionally a label (target output).

### Batch
A group of samples processed together in one forward pass. Instead of feeding the network one
sample at a time, you feed it a batch of 32, 64, or 256 samples simultaneously. This is faster
because GPUs are designed for parallel computation. In Edifice, the batch dimension is always the
first dimension of a tensor: `{batch_size, ...}`.

### Epoch
One complete pass through the entire training dataset. If your dataset has 10,000 samples and
your batch size is 100, one epoch = 100 batch updates. Training typically runs for many epochs.

### Dataset Split
Training data is divided into three parts:

```
Training set (80%)     What the network learns from
Validation set (10%)   Checked during training to detect overfitting
Test set (10%)         Evaluated once at the end to measure true performance
```

The network never trains on validation or test data. The validation set is your early warning
system -- if training loss keeps dropping but validation loss starts rising, the network is
overfitting.

## Network Architecture

### Parameters (Weights and Biases)
The learnable numbers in a network. **Weights** scale inputs (how much does this input matter?).
**Biases** shift the result (what's the baseline output regardless of input?). When we say a
network has "7 billion parameters," we mean 7 billion individually adjustable numbers. During
training, every parameter gets nudged a little bit each update step.

### Layer
A distinct processing step in the network. Each layer takes a tensor in and produces a tensor
out. Layers are the building blocks of all architectures. Common types:

- **Dense (fully connected)**: every input connects to every output
- **Convolutional**: a shared filter slides across the input
- **Attention**: each position weighs contributions from all other positions
- **Normalization**: rescales activations to stabilize training
- **Recurrent**: maintains a hidden state that carries information across timesteps

### Activation Function
A non-linear function applied after a layer's linear transformation. Without activations, a
stack of layers would collapse into a single linear function. Common activations:

```
ReLU:     max(0, x)              Most common. Simple, fast. Zero for negatives.
SiLU:     x * sigmoid(x)        Smoother than ReLU. Used in modern transformers.
GELU:     x * Φ(x)              Gaussian-weighted. Popular in language models.
Sigmoid:  1 / (1 + e^-x)        Squashes to (0, 1). Used for probabilities.
Tanh:     (e^x - e^-x)/(e^x + e^-x)  Squashes to (-1, 1). Used in gates.
Softmax:  e^xi / sum(e^xj)      Squashes a vector to a probability distribution.
```

You'll see "SiLU" and "GELU" frequently in Edifice's modern architectures. Older architectures
tend to use ReLU.

### Embedding
A learned mapping from discrete items (words, tokens, categories) to continuous vectors. Instead
of representing the word "cat" as an arbitrary integer ID like 4271, you represent it as a
256-dimensional vector of learned floating-point numbers. This lets the network discover that
similar things have similar vectors -- "cat" and "kitten" end up nearby in embedding space.

In Edifice, `embed_size` is one of the most common parameters. It determines the dimensionality
of these vector representations. Larger embeddings can capture more nuance but require more
computation.

### Hidden State
Internal information a network carries forward, either across layers or across timesteps.
In recurrent networks, the hidden state is explicitly maintained -- it's the network's "memory"
of what it has seen so far. In transformers, the concept is more implicit: the evolving
representations at each layer serve as the hidden state.

When Edifice architectures have a `hidden_size` parameter, it controls the width of these
internal representations. Larger hidden sizes give the network more capacity to represent
complex patterns.

### Residual Connection (Skip Connection)
A shortcut that adds a layer's input directly to its output:

```
output = layer(input) + input
```

This seemingly trivial change was revolutionary (ResNet, 2015). It solves the vanishing gradient
problem in deep networks: gradients can flow directly through the addition, bypassing the layer
entirely if needed. Nearly every modern architecture uses residual connections. When Edifice
guides mention "residual stream" or "skip connections," this is what they mean.

### Normalization
Rescaling activations to have consistent statistics (roughly zero mean, unit variance). Without
normalization, activations can grow or shrink exponentially through many layers, making training
unstable. You'll encounter several types in Edifice:

- **Layer Normalization**: normalizes across features within each sample
- **RMSNorm**: a faster variant that skips mean centering (used in most modern architectures)
- **Batch Normalization**: normalizes across the batch (common in CNNs, less so in transformers)

### Attention
A mechanism where each element in a sequence computes how relevant every other element is to it,
then aggregates information based on those relevance scores. This is the core operation in
transformers and the subject of an entire [Edifice guide](attention_mechanisms.md). The key
intuition: attention lets any token directly access any other token, regardless of distance.

### Encoder and Decoder
Two complementary roles in many architectures:

```
Encoder:  raw data → compressed representation (understanding)
Decoder:  compressed representation → output (generation)
```

An encoder takes high-dimensional input and distills it into a lower-dimensional representation
that captures the essential information. A decoder takes a representation and expands it back
into the output space. Some architectures use only an encoder (classification), some only a
decoder (generation), and some use both (translation, VAEs).

## Training

### Loss Function
A function that measures the distance between the network's prediction and the correct answer.
The choice of loss function defines what "correct" means. Common losses:

- **Mean Squared Error (MSE)**: `mean((predicted - actual)²)` -- for regression
- **Cross-Entropy**: `-sum(target * log(predicted))` -- for classification
- **KL Divergence**: measures how one probability distribution differs from another -- used in VAEs
- **Contrastive losses**: push similar things together and different things apart in embedding space

### Optimizer
The algorithm that updates parameters based on gradients. Gradient descent is the simplest
version, but in practice everyone uses more sophisticated optimizers:

- **SGD**: basic gradient descent with optional momentum
- **Adam**: adapts the learning rate per parameter based on gradient history (the default choice)
- **AdamW**: Adam with decoupled weight decay (current standard for transformers)

The optimizer is how the network actually learns. Each optimizer makes different tradeoffs
between speed, stability, and memory usage.

### Learning Rate
The step size for parameter updates. The single most important hyperparameter:

```
Too high:    loss oscillates or diverges (overshooting the valley)
Too low:     training is painfully slow (tiny steps)
Just right:  loss decreases steadily toward a good solution
```

Modern practice often uses a **learning rate schedule** that starts high and decreases over
training, or warms up from a low value and then decays.

### Gradient
The derivative of the loss with respect to a parameter. It tells you: "if I increase this
parameter slightly, how much does the loss change, and in which direction?" Gradients point
toward increasing loss, so you step in the opposite direction.

### Backpropagation
The algorithm for computing gradients efficiently by working backward from the loss through
the network. See the [ML Foundations](ml_foundations.md) guide for the intuition. In Edifice,
you never implement this -- Nx's automatic differentiation handles it.

### Hyperparameters
Settings that you choose *before* training, as opposed to parameters that are *learned during*
training. Examples: learning rate, batch size, number of layers, hidden size, dropout rate.
The options you pass to `Edifice.build(:mamba, embed_size: 256, num_layers: 4)` are
hyperparameters.

### Regularization
Techniques that prevent overfitting by constraining the network:

- **Dropout**: randomly set a fraction of neurons to zero during training
- **Weight decay**: add a penalty proportional to the magnitude of weights
- **Early stopping**: stop training when validation performance degrades

### Fine-Tuning
Taking a network that was trained on one task and continuing training on a different (usually
smaller) dataset. The pretrained weights provide a strong starting point. Edifice's LoRA and
Adapter modules (in the Meta family) are specifically designed for parameter-efficient fine-tuning,
where you freeze most of the pretrained weights and only train a small number of new parameters.

## Architecture-Specific Terms

These appear frequently across the Edifice guides:

### Sequence Length
The number of timesteps or tokens in a sequential input. A sentence of 20 words has sequence
length 20. A recording of 60 game frames has sequence length 60. Many architecture choices
(attention vs. SSMs vs. recurrence) are driven by how sequence length affects computation cost.

### Context Window
The maximum sequence length a model can process. Attention-based models have quadratic cost in
sequence length (doubling the context costs 4x), which is why efficient alternatives like SSMs
and linear attention exist. In Edifice, `window_size` or `max_seq_len` control this.

### Latent Space
A compressed, learned representation space. When an encoder maps a 784-dimensional image to a
32-dimensional vector, that 32-dimensional space is the latent space. Points that are close in
latent space should correspond to similar data. Generative models like VAEs sample from this
space to create new data.

### Token
The fundamental unit a sequence model operates on. In text, a token might be a word or a subword
piece. In vision, a token is an image patch. In game AI, a token is a frame's worth of state.
The idea is always the same: break continuous input into discrete chunks that the network
processes one at a time.

### Pooling
Aggregating a variable-length representation into a fixed-size one. After processing a
sequence of 60 tokens, you might need a single vector for classification. Common strategies:

- **Mean pooling**: average all token representations
- **Max pooling**: take the maximum across tokens per feature
- **Last token**: use only the final token's representation (common in causal models)
- **CLS token**: use a special learned token prepended to the sequence

### Projection
A linear transformation (matrix multiply + optional bias) that changes a tensor's feature
dimension. Used constantly in neural networks to map between different sizes:

```
Input:  {batch, seq, 256}
Dense:  W is {256, 512}
Output: {batch, seq, 512}
```

When Edifice guides say "project to hidden size," they mean a dense layer that maps from one
feature dimension to another.

## What's Next

With this vocabulary in hand, you can:

1. **[Problem Landscape](problem_landscape.md)** -- understand what types of problems exist and which architectures solve them
2. **[Reading Edifice](reading_edifice.md)** -- learn the code patterns that every architecture in this library follows
3. Start reading any architecture guide and look up terms here as needed