Chapter 1 — From Line to Curve: The Leap to Nonlinearity

Why Lines Aren't Enough

Last chapter, we fit a line to linear data. It worked beautifully. But what if the data looks like this?

A line goes up or goes down. It can't go up, then down, then up again. The data has a curve (a sine wave). No matter how hard gradient descent tries, a line will always miss the pattern. The loss will plateau at a high value.

⚠️ This is called "underfitting"

When your model is too simple to capture the structure in your data, it underfits. The loss stays high regardless of training. Adding nonlinearity is how we fix this.

ReLU — The Simplest Curve Builder

The simplest nonlinear function that actually works is called ReLU (Rectified Linear Unit):

fn relu(x: f64) -> f64 {
    if x > 0.0 { x } else { 0.0 }
}

That's it. If the input is positive, pass it through. If negative, clamp to zero. It's a "bent" line — linear on one side, flat on the other.

Alone, ReLU is just a bend. But here's the magic: a sum of many shifted and scaled ReLUs can approximate any curve. Each ReLU creates a "bump" or "corner." Stack enough of them and you get a sine wave, a parabola, or anything in between.

💡 The "LEGO brick" view

Think of ReLU as a LEGO brick. One brick builds a straight step. Two bricks at different positions build a bump. A thousand bricks build a cathedral. Deep learning is just scaling up this same stacking process.

A 2-Layer Network

Here's the architecture. We'll call it a "2-layer network" because there are two parameterised layers between input and output:

// Layer 1:  x → W1 * x + b1 → ReLU → h    (hidden layer, 32 neurons)
// Layer 2:  h → W2 * h + b2 → ŷ             (output layer, 1 neuron)

// Forward pass
fn forward(x: f64, w1: &[f64], b1: &[f64], w2: &[f64], b2: f64) -> f64 {
    let n_hidden = w1.len();

    // Hidden layer: ReLU(W1 * x + b1) for each neuron
    let mut h = vec![0.0; n_hidden];
    for i in 0..n_hidden {
        let z = w1[i] * x + b1[i];
        h[i] = if z > 0.0 { z } else { 0.0 }; // ReLU
    }

    // Output layer: weighted sum of hidden activations
    let mut y_hat = b2;
    for i in 0..n_hidden {
        y_hat += w2[i] * h[i];
    }
    y_hat
}

The total number of knobs: 32 (w1) + 32 (b1) + 32 (w2) + 1 (b2) = 97. Our line-fitter had 2 knobs. We've gone from 2 to 97 — and that's a tiny network.

Putting It Together — Fit a Sine Wave

The full program: generate sine wave data, build a 2-layer network with 32 hidden neurons, and train it with gradient descent — computing gradients manually for each of the 97 parameters.

src/main.rs — training loop

fn main() {
    // --- Generate sine wave data ---
    let n = 200;
    let mut xs: Vec<f64> = Vec::with_capacity(n);
    let mut ys: Vec<f64> = Vec::with_capacity(n);
    let mut rng = XorShift64::new(42);

    for i in 0..n {
        let x = (i as f64 / n as f64) * 4.0 * std::f64::consts::PI; // [0, 4π)
        let noise = (rng.next_f64() - 0.5) * 0.3;
        let y = x.sin() + noise;
        xs.push(x);
        ys.push(y);
    }

    // --- Initialise parameters (97 knobs) ---
    let n_hidden = 32;
    let mut w1: Vec = (0..n_hidden).map(|_| rng.next_f64() * 2.0 - 1.0).collect();
    let mut b1: Vec = (0..n_hidden).map(|_| rng.next_f64() * 2.0 - 1.0).collect();
    let mut w2: Vec = (0..n_hidden).map(|_| rng.next_f64() * 2.0 - 1.0).collect();
    let mut b2: f64 = 0.0;

    let lr = 0.01;

    // --- Training ---
    for step in 0..5000 {
        // Gradients
        let mut dw1 = vec![0.0; n_hidden];
        let mut db1 = vec![0.0; n_hidden];
        let mut dw2 = vec![0.0; n_hidden];
        let mut db2 = 0.0;

        for i in 0..n {
            let x = xs[i];
            let y_true = ys[i];

            // Forward pass (also store h for backward)
            let mut h = vec![0.0; n_hidden];
            let mut zs = vec![0.0; n_hidden];
            for j in 0..n_hidden {
                zs[j] = w1[j] * x + b1[j];
                h[j] = if zs[j] > 0.0 { zs[j] } else { 0.0 }; // ReLU
            }
            let y_pred = w2.iter().zip(&h).map(|(w, h)| w * h).sum::() + b2;

            let error = y_pred - y_true;

            // Gradients for output layer
            db2 += 2.0 * error;
            for j in 0..n_hidden {
                dw2[j] += 2.0 * error * h[j];
            }

            // Gradients for hidden layer (backprop through ReLU)
            for j in 0..n_hidden {
                let relu_grad = if zs[j] > 0.0 { 1.0 } else { 0.0 };
                let dhidden = 2.0 * error * w2[j] * relu_grad;
                dw1[j] += dhidden * x;
                db1[j] += dhidden;
            }
        }

        // Average and update
        let nf = n as f64;
        for j in 0..n_hidden {
            w1[j] -= lr * dw1[j] / nf;
            b1[j] -= lr * db1[j] / nf;
            w2[j] -= lr * dw2[j] / nf;
        }
        b2 -= lr * db2 / nf;

        // Print loss every 500 steps
        if step % 500 == 0 {
            let loss = (0..n).map(|i| {
                let y_pred = forward(xs[i], &w1, &b1, &w2, b2);
                let e = y_pred - ys[i];
                e * e
            }).sum::() / n as f64;
            println!("step {:>4} | MSE: {:.6}", step, loss);
        }
    }
}

When you run it, you should see the MSE drop from around 0.5 (a bad sine fit) to about 0.01 (a good sine fit). The network learns the oscillation pattern despite never being told "this is a sine wave."

What "Deep" Actually Means

Our network has one hidden layer. That's "shallow." Deep means many hidden layers stacked:

// Shallow:    x → hidden → ŷ
// Deep:       x → hidden₁ → hidden₂ → hidden₃ → ... → hidden₁₆ → ŷ

Why does depth matter? Each layer builds on the previous one's representations. The first layer might detect edges. The second layer combines edges into shapes. The third combines shapes into objects. This hierarchy of abstraction is what makes deep learning so powerful — and it emerges automatically from the training process.

📐 The Universal Approximation Theorem

A single hidden layer with enough neurons can approximate any continuous function arbitrarily well. That's the theory. In practice, deep networks (many layers, fewer neurons per layer) learn more efficiently and generalise better — they build hierarchical representations instead of one giant flat lookup table.

The "deep" in deep learning just means "many layers of ReLU-and-weighted-sum stacked on top of each other." You just built the shallow version. Scale it up and you get everything from GPT-4 to AlphaFold.

Key Takeaways

→ Lines can't fit curves. To model oscillating or nonlinear data, you need nonlinear functions.
→ ReLU is the simplest useful nonlinearity: max(0, x). It bends, which is enough.
→ A hidden layer of ReLU neurons lets the network learn multiple "bumps" and combine them into complex curves.
→ "Deep" means many layers. Each layer builds on the last, creating hierarchical abstractions.
→ You now have a working neural network. It has 97 knobs instead of 2. You wrote every line.