# ML Foundations > What neural networks are, how they learn, and why any of this works at all. ## Why This Guide Exists Edifice gives you 90+ neural network architectures. Before you explore them, you need a mental model of what a neural network actually *is* and what it means for one to "learn." This guide builds that foundation from scratch. No prior ML knowledge is assumed -- just basic comfort with the idea that numbers go in and numbers come out. If you already know what a loss function is and can explain backpropagation at a high level, skip to [Reading Edifice](reading_edifice.md) or the [Learning Path](learning_path.md). ## What Is a Neural Network? A neural network is a function made of simple, stacked building blocks. Each block takes numbers in, transforms them, and passes them forward. Stack enough of these blocks together with the right transformations, and the network can approximate remarkably complex patterns. The fundamental unit is the **neuron** (also called a node or unit). A neuron does three things: ``` 1. Multiply each input by a weight (how important is this input?) 2. Add all the weighted inputs together (combine the evidence) 3. Apply an activation function (introduce non-linearity) Concretely: output = activation( w1*x1 + w2*x2 + ... + wN*xN + bias ) ``` The weights and bias are the neuron's **parameters** -- the knobs the network adjusts during learning. The activation function is what makes neural networks more powerful than simple linear regression: without it, stacking layers would just produce another linear function, no matter how many layers you add. ### Layers Neurons are organized into **layers**. A layer is just a group of neurons that all process the same inputs and produce their outputs together: ``` Input Layer 1 Layer 2 Output [x1] ──────→ [n1] ──────→ [n5] ──────→ [prediction] [x2] ──╲ ╱→ [n2] ──╲ ╱→ [n6] ──────→ ╳ ╳ [x3] ──╱ ╲→ [n3] ──╱ ╲→ [n7] ╲→ [n4] ──╱ ``` The key insight: **each neuron in a layer connects to every neuron in the next layer** (in a standard "dense" or "fully connected" layer). This means a layer with 4 neurons connecting to a layer with 3 neurons has 4 × 3 = 12 weight parameters, plus 3 biases. Three terms you'll see everywhere: - **Input layer**: the raw data entering the network (not really a "layer" of neurons -- just the data) - **Hidden layers**: the intermediate layers where the network builds up internal representations - **Output layer**: the final layer that produces the prediction A network with many hidden layers is called a **deep** neural network -- that's where "deep learning" comes from. The depth is what gives these networks their power: early layers learn simple features, and later layers combine those into increasingly abstract representations. ### The Forward Pass When data flows from input to output through the network, that's called the **forward pass**. Nothing mysterious -- it's just function composition. The output of layer 1 becomes the input to layer 2, and so on: ``` input → layer_1(input) → layer_2(...) → layer_3(...) → prediction ``` Every architecture in Edifice -- whether it's a simple MLP, a transformer, a Mamba SSM, or a graph network -- ultimately performs a forward pass. What differs is the *structure* of those intermediate transformations. Some architectures look at sequences one token at a time (recurrent networks). Some let every token attend to every other token (transformers). Some model the data as continuous dynamical systems (state space models). But the forward pass concept is universal. ## What Does "Learning" Mean? A neural network starts with random parameters. Its predictions are garbage. Learning is the process of adjusting those parameters so the predictions get better. This requires three ingredients: ### 1. A Loss Function The **loss function** (also called cost function or objective) measures how wrong the network's predictions are. It takes the network's output and the correct answer, and produces a single number: the loss. Lower is better. ``` ┌────────────────┐ network output ──→│ Loss Function │──→ single number (the loss) correct answer ──→│ │ └────────────────┘ Examples: - Predicting a number? Loss = (predicted - actual)² - Classifying images? Loss = -log(probability of correct class) ``` The choice of loss function tells the network what "better" means. Different problems use different loss functions, and this choice shapes how the network learns. ### 2. Gradient Descent Once we have a loss, we need a way to reduce it. **Gradient descent** is the core algorithm. The idea is intuitive: the gradient tells you which direction increases the loss fastest, so you step in the opposite direction. ``` Think of it like descending a mountain in fog: - You can't see the valley, but you can feel the slope under your feet - At each step, you move in the steepest downhill direction - Eventually you reach a low point The "slope" is the gradient -- the derivative of the loss with respect to each parameter. The "step size" is the learning rate -- how far you move each update. new_weight = old_weight - learning_rate × gradient ``` A small learning rate means slow, careful progress. A large learning rate means faster movement but with the risk of overshooting the valley entirely. Choosing the right learning rate is one of the most impactful decisions in training. ### 3. Backpropagation **Backpropagation** is how the network figures out each parameter's gradient. It's just the chain rule from calculus, applied systematically backward through the network: ``` Forward: input → layer_1 → layer_2 → layer_3 → prediction → loss Backward: input ← layer_1 ← layer_2 ← layer_3 ← prediction ← loss ↑ "how does each weight affect this loss?" ``` For each weight in the network, backpropagation computes: "if I increase this weight by a tiny amount, how much does the loss change?" Weights that contribute a lot to the error get large gradients (big updates). Weights that barely affect the loss get small gradients (small updates). This is what makes learning efficient -- the network focuses its adjustments where they matter most. You don't need to implement backpropagation yourself. Nx (the numerical computing library under Edifice) handles this automatically through **automatic differentiation**. You define the forward pass, and Nx computes all the gradients for you. ## The Training Loop Training a neural network is a repetitive cycle: ``` repeat until good enough: 1. Forward pass: feed a batch of data through the network 2. Compute loss: measure how wrong the predictions are 3. Backward pass: compute gradients via backpropagation 4. Update weights: adjust parameters in the direction that reduces loss ``` One pass through the entire training dataset is called an **epoch**. In practice, you don't feed the whole dataset at once -- you split it into **batches** (typically 32-512 samples) and update weights after each batch. This is called **mini-batch gradient descent**, and it's what virtually everyone uses because: - Full-dataset gradient computation is too expensive for large datasets - The noise from random batches actually helps escape shallow local minima - It enables training on data that doesn't fit in memory A typical training run might be 10-100 epochs, with hundreds or thousands of batch updates per epoch. ## Tensors and Shapes Neural networks operate on **tensors** -- multi-dimensional arrays of numbers. If you know what a matrix is, a tensor is just the generalization to any number of dimensions: ``` Scalar: 42 shape: () 0 dimensions Vector: [1, 2, 3] shape: {3} 1 dimension Matrix: [[1, 2], [3, 4], [5, 6]] shape: {3, 2} 2 dimensions 3D Tensor: a stack of matrices shape: {4, 3, 2} 3 dimensions ``` In Edifice and Nx, shapes are written as tuples. The most common shapes you'll encounter: ``` {batch_size, features} Tabular data or network output {batch_size, sequence_length, features} Sequences (text, time series, game frames) {batch_size, height, width, channels} Images ``` The **batch dimension** (always first) is how many samples the network processes simultaneously. Processing samples in batches is more efficient than one at a time because modern hardware (GPUs especially) is optimized for parallel operations on large blocks of numbers. Understanding shapes is critical for working with Edifice. When you see something like `{1, 60, 256}`, that means: 1 sample, 60 timesteps, 256 features per timestep. A Mamba model with `embed_size: 256` and `window_size: 60` expects exactly that input shape. ## Generalization: The Actual Goal The point of training isn't to memorize the training data -- it's to learn patterns that apply to **new, unseen data**. This is called **generalization**, and it's the central challenge of machine learning. Two failure modes: ``` Underfitting Overfitting ─────────── ────────── Network is too simple or Network memorizes the training data undertrained to capture but fails on new data. the underlying pattern. Training loss: high Training loss: very low Test loss: high Test loss: high "Can't learn the pattern" "Learned the noise, not the signal" ``` Think of it like studying for a test. Underfitting is not studying enough -- you don't know the material. Overfitting is memorizing specific practice problems without understanding the concepts -- you ace the practice test but fail the real one. Techniques for fighting overfitting (called **regularization**) include: - **Dropout**: randomly zeroing out neurons during training, forcing the network to not rely on any single neuron - **Weight decay**: penalizing large weights, encouraging simpler solutions - **Early stopping**: stop training when performance on a held-out validation set starts to degrade - **Data augmentation**: artificially expanding training data through transformations ## Why Architecture Matters If all neural networks do the same basic thing (forward pass, loss, gradient descent), why do we need so many different architectures? Because **structure encodes assumptions about the data**. The right architecture builds in the right biases for your problem: ``` Data Type Key Property Architecture Bias ───────── ──────────── ───────────────── Images Spatial locality Convolutions: share filters across positions Sequences Temporal ordering Recurrence or attention: model dependencies over time Graphs Relational structure Message passing: aggregate information from neighbors Sets Permutation invariance Symmetric aggregation: order doesn't matter ``` A convolutional network "knows" that a cat's ear looks the same regardless of where it appears in an image. A recurrent network "knows" that word order matters. A graph network "knows" that nodes interact through edges. These structural biases mean the network needs less data and less training to learn the pattern, because the architecture already encodes part of the answer. This is why Edifice has 19 families -- each family encodes a different set of assumptions about what the data looks like and how it should be processed. ## What's Next With these foundations in place, you're ready for: 1. **[Core Vocabulary](core_vocabulary.md)** -- the precise terminology used across all Edifice guides 2. **[Problem Landscape](problem_landscape.md)** -- how different ML problems map to different architecture families 3. **[Reading Edifice](reading_edifice.md)** -- understanding the code patterns in this library 4. **[Learning Path](learning_path.md)** -- a guided tour through the 19 architecture families