Micrograd Demo Walkthrough

Copy Markdown View Source

The goal

The demo teaches backpropagation by training a small scalar MLP classifier on a two-dimensional toy dataset. It mirrors the official micrograd workflow while keeping all data generation, loss computation, training, and graph inspection in pure Elixir.

Scalar autodiff warmup

The notebook starts with scalar values:

alias MicrogradEx.Value

x = Value.new(-4.0, label: "x")
y = x |> Value.mul(x) |> Value.relu()

gradients = Value.backward(y)
Value.grad(x, gradients)

Each scalar operation creates a new Value and records a small local derivative edge. Value.backward/1 walks the graph in reverse and returns a Gradients table.

The two-moons dataset

The official Python notebook uses sklearn's make_moons. MicrogradEx uses MicrogradEx.Datasets.moons/2, a deterministic pure-Elixir generator with the same educational role.

Labels are -1.0 and 1.0 because the max-margin loss uses yi * scorei.

The MLP

The main model is:

alias MicrogradEx.NN.MLP

model = MLP.new(2, [16, 16, 1], seed: {1337, 1337, 1337})

This means:

  • 2 input values;
  • first hidden layer with 16 neurons;
  • second hidden layer with 16 neurons;
  • output layer with 1 neuron.

Hidden layers use ReLU. The final layer is linear.

Parameter count

The official demo shape has 337 parameters:

First layer: 16 * (2 + 1) = 48
Second layer: 16 * (16 + 1) = 272
Output layer: 1 * (16 + 1) = 17
Total: 337

The + 1 in each layer is the bias parameter per neuron.

The max-margin loss

The classification score is the scalar model output. A positive score predicts class 1; a non-positive score predicts class -1.

The loss is:

loss_i = relu(1 - yi * score_i)
data_loss = mean(loss_i)
reg_loss = alpha * sum(p * p)
total_loss = data_loss + reg_loss

In code this is MicrogradEx.Losses.max_margin/4.

L2 regularization

The regularization term penalizes large parameters:

alpha * sum(p * p for p <- NN.parameters(model))

The default alpha is 1.0e-4, matching the official demo.

The training loop

Training is immutable:

gradients = Value.backward(total_loss)
next_model = NN.apply_gradients(model, gradients, learning_rate)

MicrogradEx.Trainer.train/3 runs this loop for 100 steps by default and records loss, data loss, regularization loss, accuracy, and learning rate.

Plotting loss and accuracy

MicrogradEx.PlotData converts training runs into plain rows:

PlotData.loss_history(run)
PlotData.accuracy_history(run)

The notebook renders those rows with Vega-Lite.

Decision boundary

The decision boundary is built by evaluating the trained model over a padded grid:

PlotData.decision_boundary(run.final_model, dataset, h: 0.25)

Every grid point is classified by score sign, then plotted behind the training data.

Graph inspection

The scalar graph is built during forward operations. MicrogradEx exposes it without mutation:

  • Graph.nodes/2 shows scalar node data and gradients;
  • Graph.edges/1 shows parent-to-child dependencies and local gradients;
  • Graph.to_dot/2 exports DOT text for optional Graphviz rendering.

What to try next

Change one variable at a time:

  • noise: 0.2;
  • MLP.new(2, [8, 8, 1]);
  • steps: 50;
  • alpha: 0.0;
  • h: 0.35 for a faster decision-boundary grid.

For a broader set of experiments, open notebooks/micrograd_extras.livemd. It compares datasets, model sizes, regularization, learning-rate schedules, decision-boundary resolution, and a spiral dataset challenge.