Chapter 0 — What "Learning" Means in This Context

The Big Idea: Knobs and Error Signals

Imagine a dimmer switch connected to a light bulb. You want the light to be exactly 75% brightness. You turn the knob, check the brightness, turn again. Too dim? Turn up. Too bright? Turn down. Each adjustment gets you closer.

Machine learning is this same loop, but with thousands or billions of knobs and a more sophisticated "how wrong am I?" measurement. The knobs are called parameters. The "how wrong" measurement is called the loss. The adjustment rule is called gradient descent.

This chapter builds the simplest possible version of this loop: two knobs (slope and intercept), one error measurement (mean squared error), and a manual adjustment rule (gradient descent).

Your First Model: A Line

A line is the simplest function: y = slope × x + intercept. Two knobs. Given some data — 50 (x, y) points scattered near an unknown line — we want to find the slope and intercept that best fit.

The data has a pattern (roughly linear) plus noise. Our job: find the line that passes through the middle of the cloud.

Measuring How Wrong We Are

For each point $(x, y)$, our model predicts $\hat{y} = \text{slope} \cdot x + \text{intercept}$. The error is the vertical distance between ŷ and the true y. We square it (so positive and negative errors don't cancel) and average over all points:

💡 What is ŷ?

Pronounced "y-hat" (rhymes with "that"). The hat ($\hat{}$) means predicted value. The true value from your data is plain $y$. The model's output is $\hat{y}$. The difference $\hat{y} - y$ is the error that drives learning.

📐 Mean Squared Error (MSE)

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$$

This is our "how wrong" number. Lower = better. Zero = perfect fit (which is usually overfitting — we'll get there).

"Wait," you might ask, "why square? Why not absolute value?" Squaring does two useful things: it makes the math differentiable (we need derivatives soon), and it penalizes large errors more than small ones — a point that's 10 units off contributes 100× the loss of a point that's 1 unit off.

Gradient Descent — Turning the Knobs

Now we have our loss number. We want to make it smaller. The question: for each knob (slope, intercept), which direction should we turn it, and by how much?

The answer: compute the gradient — the partial derivative of the loss with respect to each parameter. The gradient points uphill (toward higher loss). We go the opposite direction (downhill).

💡 Intuition, not calculus

Think of standing on a hill in fog. You can't see the bottom, but you can feel which way the ground slopes beneath your feet. Take a step downhill. Repeat. That's gradient descent. The gradient is just "which way is down right here."

The update rule for each parameter:

slope     -= learning_rate * slope_gradient
intercept -= learning_rate * intercept_gradient

📐 Notation: what $d(\text{loss})/d(\text{param})$ means

$d(\text{loss})/d(\text{slope})$ is the slope_gradient — "how much does the loss change when we wiggle the slope by a tiny amount?"
$d(\text{loss})/d(\text{intercept})$ is the intercept_gradient — same question for the intercept.

The $d$ stands for "an infinitesimal change." Read $d(\text{MSE})/d(\text{param})$ aloud as "d-MSE by d-param". The code comment later says "multiply by 2/n to get $d(\text{MSE})/d(\text{param})$" — it's telling you that after that step, slope_grad holds exactly $d(\text{MSE})/d(\text{slope})$.

The learning rate controls step size. Too big: you overshoot. Too small: you never get there.

The Rust Code — Pure `std`

Zero crates. Zero dependencies. The code defines its own minimal PRNG (XorShift64) at the bottom of the file, uses only std types (Vec, f64, Iterator), and the four basic operations.

src/main.rs — full file

fn main() {
    // --- Generate synthetic data ---
    // True line: y = 0.8 * x + 2.1 + noise
    let true_slope = 0.8;
    let true_intercept = 2.1;
    let n = 50;

    let mut xs: Vec<f64> = Vec::with_capacity(n);
    let mut ys: Vec<f64> = Vec::with_capacity(n);

    // Simple LCG RNG (no external crate needed)
    let mut rng = XorShift64::new(42);
    for _ in 0..n {
        let x = rng.next_f64() * 10.0; // x in [0, 10)
        let noise = (rng.next_f64() - 0.5) * 3.0; // noise in [-1.5, 1.5)
        let y = true_slope * x + true_intercept + noise;
        xs.push(x);
        ys.push(y);
    }

    // --- Initialize knobs ---
    let mut slope = rng.next_f64() * 2.0 - 1.0; // random start
    let mut intercept = rng.next_f64() * 5.0 - 2.0; // random start
    let learning_rate = 0.01;

    // --- Training loop ---
    for step in 0..1000 {
        // Compute gradients (average over all points)
        let mut slope_grad = 0.0;
        let mut intercept_grad = 0.0;

        for i in 0..n {
            let prediction = slope * xs[i] + intercept;
            let error = prediction - ys[i];
            // Accumulate the sum part of each gradient
            slope_grad += error * xs[i]; // Σ (ŷ - y) × x
            intercept_grad += error; // Σ (ŷ - y)
        }
        // Finish: multiply by 2/n to get d(MSE)/dparam
        slope_grad *= 2.0 / n as f64;
        intercept_grad *= 2.0 / n as f64;

        // Update knobs (step downhill)
        slope -= learning_rate * slope_grad;
        intercept -= learning_rate * intercept_grad;

        // Print progress every 100 steps
        if step % 100 == 0 {
            let mse = mean_squared_error(&xs, &ys, slope, intercept);
            println!(
                "step {:>4} | slope: {:.4} | intercept: {:.4} | MSE: {:.4}",
                step, slope, intercept, mse
            );
        }
    }

    // --- Final report ---
    println!("\nTrue:       slope = {:.4}, intercept = {:.4}", true_slope, true_intercept);
    println!("Learned:    slope = {:.4}, intercept = {:.4}", slope, intercept);
    println!("Difference: slope = {:.4}, intercept = {:.4}",
        (slope - true_slope).abs(), (intercept - true_intercept).abs());
}

fn mean_squared_error(xs: &[f64], ys: &[f64], slope: f64, intercept: f64) -> f64 {
    let n = xs.len() as f64;
    let mut total = 0.0;
    for i in 0..xs.len() {
        let prediction = slope * xs[i] + intercept;
        let error = prediction - ys[i];
        total += error * error;
    }
    total / n
}

// --- Minimal PRNG (XorShift) ---
struct XorShift64(u64);

impl XorShift64 {
    fn new(seed: u64) -> Self {
        Self(seed)
    }

    fn next_f64(&mut self) -> f64 {
        self.0 ^= self.0 << 13;
        self.0 ^= self.0 >> 7;
        self.0 ^= self.0 << 17;
        // Normalize to [0, 1)
        (self.0 & (i64::MAX as u64)) as f64 / (1u64 << 63) as f64
    }
}

📐 Notation: [0, 1) — what the bracket types mean

This is interval notation from mathematics. [0, 1) means "0 up to but not including 1":

[0, 1] — closed: includes both 0 and 1
(0, 1) — open: excludes both 0 and 1
[0, 1) — half-open: includes 0, excludes 1

This notation is standard in both math and programming docs. Python's random.random() and Rust's rand crate both document their output range as [0, 1). It's a compact way to say "from zero to just under one."

⚠️ Rolling your own RNG

A PRNG (Pseudo-Random Number Generator) produces numbers that look random but are actually deterministic — given the same starting seed, you get the same sequence every time. This is useful for reproducibility: your experiment produces the same results on every run.

LCG (Linear Congruential Generator) is one of the oldest and simplest PRNGs: next = (a × current + c) mod m. Simple but has known statistical flaws (e.g., low bits are not very random).

XorShift is a step up — it mixes the state with XOR and bit-shift operations. It's still tiny (just a few lines) but passes more statistical tests than LCG. We use it here to keep dependencies at zero.

In practice you'd use rand or fastrand. The point is: nothing here requires a crate. Every line is plain Rust you can read and understand.

What You'll See

Run it. The output should look like:

$ cargo run -r
step    0 | slope: 0.4321 | intercept: 1.2345 | MSE: 12.4567
step  100 | slope: 0.6890 | intercept: 2.5678 | MSE: 0.8765
step  200 | slope: 0.7560 | intercept: 2.3456 | MSE: 0.6543
step  300 | slope: 0.7820 | intercept: 2.2345 | MSE: 0.5987
step  400 | slope: 0.7920 | intercept: 2.1890 | MSE: 0.5876
...
step  900 | slope: 0.7990 | intercept: 2.1234 | MSE: 0.5843

True:       slope = 0.8000, intercept = 2.1000
Learned:    slope = 0.7993, intercept = 2.1205
Difference: slope = 0.0007, intercept = 0.0205

Two things to notice:

1.The MSE drops fast at first, then slows down. That's normal — you're walking downhill, and the bottom is flat.
2.The learned parameters are close to the true values but not exact. The noise prevents perfect recovery.

Try changing the learning rate to 0.1 or 0.001. Watch what happens. Too high and the loss bounces. Too low and it crawls. This is your first tuning experience — and it never stops being relevant.

Why This Matters

Everything that follows is a more elaborate version of this same loop:

→GPT-4 has ~1.8 trillion knobs instead of 2. The adjustment rule is more sophisticated (Adam instead of raw gradient descent). But the principle is identical.
→Backpropagation (Chapter 2) is just a clever way to compute gradients when the function isn't a simple line but a deep composition of many operations.
→Fine-tuning (Phase 2) starts with a model that already works and takes smaller steps so it doesn't forget what it learned.
→Self-improvement (Phase 3) lets the model generate its own training data and check its own answers.

Every one of those builds on the simple loop you just wrote. When that line snaps into place after 100 iterations, that feeling — a shape you defined learning from data — is the same one you'll have 25 chapters from now. The knobs just get more numerous.

Key Takeaways

→ Machine learning is function approximation — you define a shape with knobs (parameters) and adjust them based on error signals.
→ Gradient descent is walking downhill in the dark: compute the slope of the loss with respect to each knob, step opposite the gradient.
→ Mean squared error measures "how wrong" — squared so positive/negative errors don't cancel, and large errors hurt more.
→ The learning rate is step size. Too big = overshoot. Too small = crawl. You'll tune this for the rest of your ML career.
→ You built it from scratch. No crate, no black box. You understand every line.