Chapter 3

Course Home

Tensors — Organizing Numbers for Parallel Computation

Chapter 2's autograd engine tracked every scalar operation individually — one graph node per +, *, or max(0, x). That's correct but slow. Real frameworks work at tensor level: one operation on a batch of 1024 numbers is a single graph node. This chapter introduces the ndarray crate and rewrites your autograd engine to operate on whole arrays at once. The speedup is dramatic — and you'll see why every ML framework is built on tensor operations.

The Scalar Bottleneck

In Chapter 2, every number was a Value node: an Rc<RefCell> with data, grad, a backward closure, and a vector of children. For a small network with 100 weights and 1000 training examples, that's 100,000+ graph nodes per forward pass.

The runtime cost is real:

📊 The fix: batch everything

Instead of tracking 1000 individual loss values, we compute the mean squared error over the whole batch in one shot. One graph node per operation — not one per training example. This is the same optimization that makes GPU training fast: GPUs have thousands of cores designed for exactly this kind of parallel array math.

What Is a Tensor?

A tensor is just a multi-dimensional array. You already know the variants:

The key idea: a tensor has a shape (the size of each dimension) and a data type (f64, f32, etc.). All elements are stored in a single contiguous block of memory, laid out according to a stride pattern.

ndarray basics

use ndarray::Array2;

// Create a 3×2 matrix
let mut a: Array2<f64> = Array2::zeros((3, 2));
a[[0, 0]] = 1.0;
a[[1, 1]] = 2.0;
// a = [[1.0, 0.0],
//      [0.0, 2.0],
//      [0.0, 0.0]]

// Matrix multiplication
let b = Array2::ones((2, 4));
let c = a.dot(&b); // shape: (3, 4)

// Element-wise operations
let d = &a + &a; // each element added
let e = a * 2.0; // broadcast scalar multiply

The ndarray crate gives us NumPy-like array operations in Rust. It uses BLAS under the hood for matrix multiplication (through the blas-src feature), and it has a rich API for slicing, broadcasting, and reshaping.

Broadcasting — The Secret Sauce

What happens when you add a vector to a matrix? Consider:

use ndarray::Array2;

// 3 rows, 2 columns
let mut matrix: Array2<f64> = Array2::zeros((3, 2));
matrix += 1.0; // scalar broadcast: add 1.0 to every element

// Row broadcast: add a different bias to each column
let biases = arr1(&[0.1, 0.2]); // shape (2,)
// matrix + biases broadcasts along rows:
// each row gets bias[0] added to col 0, bias[1] added to col 1

Broadcasting is the rule for applying an operation between arrays of different shapes. The smaller array is "stretched" to match the larger one along dimension 1. This is how we add a bias vector to every row of a matrix without writing a loop.

The rules:

  1. 1.Right-align the shapes: (3, 2) and (2,) → align on the right → (3, 2) and (1, 2)
  2. 2.For each dimension, sizes match or one is 1: 3 vs 1 → broadcast, 2 vs 2 → match
  3. 3.If a dimension is missing from the smaller array, it's treated as 1

💡 No copies are made

Broadcasting doesn't actually duplicate memory. ndarray uses stride tricks to simulate the expanded shape — the bias vector is still stored as 2 f64 values, but the operation acts as if it's a matrix. This is memory-efficient and fast.

Rewriting Autograd with ndarray

Instead of wrapping every f64 in a Value, we package all our network parameters into tensor-valued parameters. A weight matrix W of shape (input_size, output_size) is a single parameter with a single gradient of the same shape.

Here's the core idea — a Param struct that holds a tensor and its gradient:

src/param.rs — tensor-level parameter

use ndarray::Array2;

pub struct Param {
    pub data: Array2<f64>,
    pub grad: Array2<f64>,
}

impl Param {
    pub fn new(shape: (usize, usize)) -> Self {
        // Initialize with small random values
        let data = Array2::from_shape_fn(shape, |_| rand_f64() * 0.1);
        let grad = Array2::zeros(shape);
        Param { data, grad }
    }

    pub fn zero_grad(&mut self) {
        self.grad.fill(0.0);
    }

    pub fn update(&mut self, lr: f64) {
        // self.data -= lr * self.grad   (element-wise)
        self.data = &self.data - &(self.grad * lr);
    }
}

fn rand_f64() -> f64 {
    // Simple LCG for reproducibility — no external deps for this snippet
    use std::time::{SystemTime, UNIX_EPOCH};
    static mut SEED: u64 = 0;
    unsafe {
        if SEED == 0 {
            SEED = SystemTime::now()
                .duration_since(UNIX_EPOCH)
                .unwrap()
                .as_nanos() as u64;
        }
        SEED = SEED.wrapping_mul(6364136223846793005).wrapping_add(1442695040888963407);
        (SEED >> 33) as f64 / (1u64 << 31) as f64 * 0.2 - 0.1
    }
}

📐 Key difference from Chapter 2

In Chapter 2, each parameter was a Value node and gradient computation happened per-node via backward closures. Here, the gradient for the entire weight matrix is computed as a single matrix operation: grad = input.T @ loss_grad. No graph, no closures, no per-element iteration. Just linear algebra.

But wait — doesn't this mean we're back to manual gradient formulas? Yes and no. We're writing layer-level gradient formulas instead of scalar-level ones. The savings come from:

Training Loop: Batch Gradient Descent

Here's the full training loop for our sine-wave fitter, now operating on the entire dataset at once:

examples/03-tensors.rs — full training loop

use ndarray::{Array1, Array2};

fn main() {
    // Generate synthetic data: y = sin(x) + noise
    let n = 1000;
    let xs: Array1<f64> = Array1::linspace(-3.0, 3.0, n);
    let ys: Array1<f64> = xs.mapv(|x| (x * std::f64::consts::PI).sin())
        + Array1::from_shape_fn(n, |_| rand_f64() * 0.1);

    // Network: 1 input -> 64 hidden (ReLU) -> 1 output
    let mut w1 = Param::new((1, 64));
    let mut b1 = Param::new((1, 64));
    let mut w2 = Param::new((64, 1));
    let mut b2 = Param::new((1, 1));

    // Reshape input to (n, 1) matrix
    let x_mat = xs.into_shape((n, 1)).unwrap();
    let y_mat = ys.into_shape((n, 1)).unwrap();

    let lr = 0.01;
    let steps = 2000;

    for step in 0..steps {
        // --- Forward pass ---
        // z1 = x @ w1 + b1  (shape: n × 64)
        let z1 = x_mat.dot(&w1.data) + &b1.data;
        // a1 = ReLU(z1)
        let a1 = z1.mapv(|v| if v > 0.0 { v } else { 0.0 });
        // y_pred = a1 @ w2 + b2  (shape: n × 1)
        let y_pred = a1.dot(&w2.data) + &b2.data;

        // --- Loss: MSE = mean((y_pred - y)^2) ---
        let diff = &y_pred - &y_mat;
        let loss = diff.mapv(|v| v * v).mean().unwrap();

        if step % 200 == 0 {
            println!("step {step:4}: loss = {loss:.6}");
        }

        // --- Backward pass (manual, layer-level) ---
        // dL/dy_pred = 2 * (y_pred - y) / n
        let n_f64 = n as f64;
        let mut grad_y_pred = &diff * (2.0 / n_f64);

        // dL/dw2 = a1.T @ grad_y_pred
        // dL/db2 = sum(grad_y_pred, axis=0)
        let a1_t = a1.t().to_owned();
        w2.grad = a1_t.dot(&grad_y_pred);
        b2.grad = grad_y_pred.sum_axis(ndarray::Axis(0)).into_shape((1, 64)).unwrap();

        // dL/da1 = grad_y_pred @ w2.T
        let mut grad_a1 = grad_y_pred.dot(&w2.data.t());

        // dL/dz1 = dL/da1 * ReLU'(z1)
        // ReLU'(z1) = 1 where z1 > 0, else 0
        let relu_mask = z1.mapv(|v| if v > 0.0 { 1.0 } else { 0.0 });
        grad_a1 = grad_a1 * relu_mask;

        // dL/dw1 = x.T @ grad_a1
        let x_t = x_mat.t().to_owned();
        w1.grad = x_t.dot(&grad_a1);
        // dL/db1 = sum(grad_a1, axis=0)
        b1.grad = grad_a1.sum_axis(ndarray::Axis(0)).into_shape((1, 64)).unwrap();

        // --- Update ---
        // Move the line above into a helper to be concise
        let update = |p: &mut Param| p.update(lr);

        // Manually inline for clarity (same as w1.data -= lr * w1.grad, etc.)
        w1.data = &w1.data - &(w1.grad * lr);
        w2.data = &w2.data - &(w2.grad * lr);
        b1.data = &b1.data - &(b1.grad * lr);
        b2.data = &b2.data - &(b2.grad * lr);
    }

    println!("Done! Final predictions ready.");
}

⚠️ Manual gradients — but now at layer level

We've traded scalar-level autograd for manual gradient formulas at the matrix level. This is exactly what every ML framework does internally — even PyTorch's autograd ultimately calls these same matrix gradient operations. The difference is that PyTorch records which layer operations happened and calls the right gradient kernel automatically. We're writing the kernels by hand, but the scope is dramatically reduced: one formula per layer type, not one per scalar.

Measuring the Speedup

Run the Chapter 2 autograd engine on 1000 data points. Time it. Now run this tensor-based version. The difference should be 10–100×, even on your 2015 iMac. Here's why:

⏱️ Try it yourself

Add std::time::Instant::now() timing to both versions. Run on 100, 1000, and 10,000 points. Plot the scaling curves. The per-element autograd should scale O(n²) or worse; the tensor version should scale near O(n). This is the difference between "educational" and "practical" ML code.

The Bridge: This Is Your Last "From Scratch" Chapter

The syllabus calls this out explicitly. After Chapter 3:

🎯 What you own after this chapter

You understand the entire stack from individual arithmetic operations to batched matrix computations. When Chapter 4 introduces a softmax or a cross-entropy loss, you'll think "that's just a sequence of array operations" rather than "that's scary math." This is the confidence that "from scratch" learning gives you.

Key Takeaways