Chapter 1
Course HomeA straight line can only fit straight-line data. To fit curves — sine waves, spirals, the shape of a cat — you need nonlinearity. This chapter introduces the single idea that makes deep learning work: a simple nonlinear function, stacked many times, can approximate anything.
Last chapter, we fit a line to linear data. It worked beautifully. But what if the data looks like this?
A line goes up or goes down. It can't go up, then down, then up again. The data has a curve (a sine wave). No matter how hard gradient descent tries, a line will always miss the pattern. The loss will plateau at a high value.
⚠️ This is called "underfitting"
When your model is too simple to capture the structure in your data, it underfits. The loss stays high regardless of training. Adding nonlinearity is how we fix this.
The simplest nonlinear function that actually works is called ReLU (Rectified Linear Unit):
fn relu(x: f64) -> f64 {
if x > 0.0 { x } else { 0.0 }
}
That's it. If the input is positive, pass it through. If negative, clamp to zero. It's a "bent" line — linear on one side, flat on the other.
Alone, ReLU is just a bend. But here's the magic: a sum of many shifted and scaled ReLUs can approximate any curve. Each ReLU creates a "bump" or "corner." Stack enough of them and you get a sine wave, a parabola, or anything in between.
💡 The "LEGO brick" view
Think of ReLU as a LEGO brick. One brick builds a straight step. Two bricks at different positions build a bump. A thousand bricks build a cathedral. Deep learning is just scaling up this same stacking process.
Here's the architecture. We'll call it a "2-layer network" because there are two parameterised layers between input and output:
// Layer 1: x → W1 * x + b1 → ReLU → h (hidden layer, 32 neurons)
// Layer 2: h → W2 * h + b2 → ŷ (output layer, 1 neuron)
// Forward pass
fn forward(x: f64, w1: &[f64], b1: &[f64], w2: &[f64], b2: f64) -> f64 {
let n_hidden = w1.len();
// Hidden layer: ReLU(W1 * x + b1) for each neuron
let mut h = vec![0.0; n_hidden];
for i in 0..n_hidden {
let z = w1[i] * x + b1[i];
h[i] = if z > 0.0 { z } else { 0.0 }; // ReLU
}
// Output layer: weighted sum of hidden activations
let mut y_hat = b2;
for i in 0..n_hidden {
y_hat += w2[i] * h[i];
}
y_hat
}
The total number of knobs: 32 (w1) + 32 (b1) + 32 (w2) + 1 (b2) = 97.
Our line-fitter had 2 knobs. We've gone from 2 to 97 — and that's a tiny
network.
The full program: generate sine wave data, build a 2-layer network with 32 hidden neurons, and train it with gradient descent — computing gradients manually for each of the 97 parameters.
fn main() {
// --- Generate sine wave data ---
let n = 200;
let mut xs: Vec<f64> = Vec::with_capacity(n);
let mut ys: Vec<f64> = Vec::with_capacity(n);
let mut rng = XorShift64::new(42);
for i in 0..n {
let x = (i as f64 / n as f64) * 4.0 * std::f64::consts::PI; // [0, 4π)
let noise = (rng.next_f64() - 0.5) * 0.3;
let y = x.sin() + noise;
xs.push(x);
ys.push(y);
}
// --- Initialise parameters (97 knobs) ---
let n_hidden = 32;
let mut w1: Vec = (0..n_hidden).map(|_| rng.next_f64() * 2.0 - 1.0).collect();
let mut b1: Vec = (0..n_hidden).map(|_| rng.next_f64() * 2.0 - 1.0).collect();
let mut w2: Vec = (0..n_hidden).map(|_| rng.next_f64() * 2.0 - 1.0).collect();
let mut b2: f64 = 0.0;
let lr = 0.01;
// --- Training ---
for step in 0..5000 {
// Gradients
let mut dw1 = vec![0.0; n_hidden];
let mut db1 = vec![0.0; n_hidden];
let mut dw2 = vec![0.0; n_hidden];
let mut db2 = 0.0;
for i in 0..n {
let x = xs[i];
let y_true = ys[i];
// Forward pass (also store h for backward)
let mut h = vec![0.0; n_hidden];
let mut zs = vec![0.0; n_hidden];
for j in 0..n_hidden {
zs[j] = w1[j] * x + b1[j];
h[j] = if zs[j] > 0.0 { zs[j] } else { 0.0 }; // ReLU
}
let y_pred = w2.iter().zip(&h).map(|(w, h)| w * h).sum::() + b2;
let error = y_pred - y_true;
// Gradients for output layer
db2 += 2.0 * error;
for j in 0..n_hidden {
dw2[j] += 2.0 * error * h[j];
}
// Gradients for hidden layer (backprop through ReLU)
for j in 0..n_hidden {
let relu_grad = if zs[j] > 0.0 { 1.0 } else { 0.0 };
let dhidden = 2.0 * error * w2[j] * relu_grad;
dw1[j] += dhidden * x;
db1[j] += dhidden;
}
}
// Average and update
let nf = n as f64;
for j in 0..n_hidden {
w1[j] -= lr * dw1[j] / nf;
b1[j] -= lr * db1[j] / nf;
w2[j] -= lr * dw2[j] / nf;
}
b2 -= lr * db2 / nf;
// Print loss every 500 steps
if step % 500 == 0 {
let loss = (0..n).map(|i| {
let y_pred = forward(xs[i], &w1, &b1, &w2, b2);
let e = y_pred - ys[i];
e * e
}).sum::() / n as f64;
println!("step {:>4} | MSE: {:.6}", step, loss);
}
}
}
When you run it, you should see the MSE drop from around 0.5 (a bad sine fit) to about 0.01 (a good sine fit). The network learns the oscillation pattern despite never being told "this is a sine wave."
Our network has one hidden layer. That's "shallow." Deep means many hidden layers stacked:
// Shallow: x → hidden → ŷ // Deep: x → hidden₁ → hidden₂ → hidden₃ → ... → hidden₁₆ → ŷ
Why does depth matter? Each layer builds on the previous one's representations. The first layer might detect edges. The second layer combines edges into shapes. The third combines shapes into objects. This hierarchy of abstraction is what makes deep learning so powerful — and it emerges automatically from the training process.
📐 The Universal Approximation Theorem
A single hidden layer with enough neurons can approximate any continuous function arbitrarily well. That's the theory. In practice, deep networks (many layers, fewer neurons per layer) learn more efficiently and generalise better — they build hierarchical representations instead of one giant flat lookup table.
The "deep" in deep learning just means "many layers of ReLU-and-weighted-sum stacked on top of each other." You just built the shallow version. Scale it up and you get everything from GPT-4 to AlphaFold.
max(0, x). It bends, which is enough.