Chapter 3

Tensors — Organizing Numbers for Parallel Computation

Chapter 2's autograd engine tracked every scalar operation individually — one graph node per +, *, or max(0, x). That's correct but slow. Real frameworks work at tensor level: one operation on a batch of 1024 numbers is a single graph node. This chapter introduces the ndarray crate and rewrites your autograd engine to operate on whole arrays at once. The speedup is dramatic — and you'll see why every ML framework is built on tensor operations.

The Scalar Bottleneck

In Chapter 2, every number was a Value node: an Rc<RefCell> with data, grad, a backward closure, and a vector of children. For a small network with 100 weights and 1000 training examples, that's 100,000+ graph nodes per forward pass.

The runtime cost is real:

→Memory: Each Rc<RefCell> is a heap allocation with reference counting overhead. 100K nodes = ~10 MB of graph metadata alone.
→Instruction overhead: For every scalar multiply, Rust calls borrow(), reads data, allocates a new Rc, stores a closure, links children — dozens of instructions for one f64 * f64.
→No parallelism: Scalar operations process one number at a time. Your CPU's SIMD vector units sit idle.

📊 The fix: batch everything

Instead of tracking 1000 individual loss values, we compute the mean squared error over the whole batch in one shot. One graph node per operation — not one per training example. This is the same optimization that makes GPU training fast: GPUs have thousands of cores designed for exactly this kind of parallel array math.

What Is a Tensor?

A tensor is just a multi-dimensional array. You already know the variants:

→Scalar (0-D tensor): a single number, like f64
→Vector (1-D tensor): a 1D array, like Vec<f64>
→Matrix (2-D tensor): a grid of numbers, rows × columns
→3-D tensor: a stack of matrices — think of a batch of 32 RGB images, each 28×28 pixels: shape [32, 28, 28, 3]

The key idea: a tensor has a shape (the size of each dimension) and a data type (f64, f32, etc.). All elements are stored in a single contiguous block of memory, laid out according to a stride pattern.

ndarray basics

use ndarray::Array2;

// Create a 3×2 matrix
let mut a: Array2<f64> = Array2::zeros((3, 2));
a[[0, 0]] = 1.0;
a[[1, 1]] = 2.0;
// a = [[1.0, 0.0],
//      [0.0, 2.0],
//      [0.0, 0.0]]

// Matrix multiplication
let b = Array2::ones((2, 4));
let c = a.dot(&b); // shape: (3, 4)

// Element-wise operations
let d = &a + &a; // each element added
let e = a * 2.0; // broadcast scalar multiply

The ndarray crate gives us NumPy-like array operations in Rust. It uses BLAS under the hood for matrix multiplication (through the blas-src feature), and it has a rich API for slicing, broadcasting, and reshaping.

Broadcasting — The Secret Sauce

What happens when you add a vector to a matrix? Consider:

use ndarray::Array2;

// 3 rows, 2 columns
let mut matrix: Array2<f64> = Array2::zeros((3, 2));
matrix += 1.0; // scalar broadcast: add 1.0 to every element

// Row broadcast: add a different bias to each column
let biases = arr1(&[0.1, 0.2]); // shape (2,)
// matrix + biases broadcasts along rows:
// each row gets bias[0] added to col 0, bias[1] added to col 1

Broadcasting is the rule for applying an operation between arrays of different shapes. The smaller array is "stretched" to match the larger one along dimension 1. This is how we add a bias vector to every row of a matrix without writing a loop.

The rules:

1.Right-align the shapes: (3, 2) and (2,) → align on the right → (3, 2) and (1, 2)
2.For each dimension, sizes match or one is 1: 3 vs 1 → broadcast, 2 vs 2 → match
3.If a dimension is missing from the smaller array, it's treated as 1

💡 No copies are made

Broadcasting doesn't actually duplicate memory. ndarray uses stride tricks to simulate the expanded shape — the bias vector is still stored as 2 f64 values, but the operation acts as if it's a matrix. This is memory-efficient and fast.

Rewriting Autograd with `ndarray`

Instead of wrapping every f64 in a Value, we package all our network parameters into tensor-valued parameters. A weight matrix W of shape (input_size, output_size) is a single parameter with a single gradient of the same shape.

Here's the core idea — a Param struct that holds a tensor and its gradient:

src/param.rs — tensor-level parameter

use ndarray::Array2;

pub struct Param {
    pub data: Array2<f64>,
    pub grad: Array2<f64>,
}

impl Param {
    pub fn new(shape: (usize, usize)) -> Self {
        // Initialize with small random values
        let data = Array2::from_shape_fn(shape, |_| rand_f64() * 0.1);
        let grad = Array2::zeros(shape);
        Param { data, grad }
    }

    pub fn zero_grad(&mut self) {
        self.grad.fill(0.0);
    }

    pub fn update(&mut self, lr: f64) {
        // self.data -= lr * self.grad   (element-wise)
        self.data = &self.data - &(self.grad * lr);
    }
}

fn rand_f64() -> f64 {
    // Simple LCG for reproducibility — no external deps for this snippet
    use std::time::{SystemTime, UNIX_EPOCH};
    static mut SEED: u64 = 0;
    unsafe {
        if SEED == 0 {
            SEED = SystemTime::now()
                .duration_since(UNIX_EPOCH)
                .unwrap()
                .as_nanos() as u64;
        }
        SEED = SEED.wrapping_mul(6364136223846793005).wrapping_add(1442695040888963407);
        (SEED >> 33) as f64 / (1u64 << 31) as f64 * 0.2 - 0.1
    }
}

📐 Key difference from Chapter 2

In Chapter 2, each parameter was a Value node and gradient computation happened per-node via backward closures. Here, the gradient for the entire weight matrix is computed as a single matrix operation: grad = input.T @ loss_grad. No graph, no closures, no per-element iteration. Just linear algebra.

But wait — doesn't this mean we're back to manual gradient formulas? Yes and no. We're writing layer-level gradient formulas instead of scalar-level ones. The savings come from:

→No graph overhead. One Param per weight matrix, not one Value per scalar.
→Batched computation. A forward pass processes all 1000 examples at once.
→BLAS-accelerated matmul. ndarray::dot calls highly optimized linear algebra routines.

Training Loop: Batch Gradient Descent

Here's the full training loop for our sine-wave fitter, now operating on the entire dataset at once:

examples/03-tensors.rs — full training loop

use ndarray::{Array1, Array2};

fn main() {
    // Generate synthetic data: y = sin(x) + noise
    let n = 1000;
    let xs: Array1<f64> = Array1::linspace(-3.0, 3.0, n);
    let ys: Array1<f64> = xs.mapv(|x| (x * std::f64::consts::PI).sin())
        + Array1::from_shape_fn(n, |_| rand_f64() * 0.1);

    // Network: 1 input -> 64 hidden (ReLU) -> 1 output
    let mut w1 = Param::new((1, 64));
    let mut b1 = Param::new((1, 64));
    let mut w2 = Param::new((64, 1));
    let mut b2 = Param::new((1, 1));

    // Reshape input to (n, 1) matrix
    let x_mat = xs.into_shape((n, 1)).unwrap();
    let y_mat = ys.into_shape((n, 1)).unwrap();

    let lr = 0.01;
    let steps = 2000;

    for step in 0..steps {
        // --- Forward pass ---
        // z1 = x @ w1 + b1  (shape: n × 64)
        let z1 = x_mat.dot(&w1.data) + &b1.data;
        // a1 = ReLU(z1)
        let a1 = z1.mapv(|v| if v > 0.0 { v } else { 0.0 });
        // y_pred = a1 @ w2 + b2  (shape: n × 1)
        let y_pred = a1.dot(&w2.data) + &b2.data;

        // --- Loss: MSE = mean((y_pred - y)^2) ---
        let diff = &y_pred - &y_mat;
        let loss = diff.mapv(|v| v * v).mean().unwrap();

        if step % 200 == 0 {
            println!("step {step:4}: loss = {loss:.6}");
        }

        // --- Backward pass (manual, layer-level) ---
        // dL/dy_pred = 2 * (y_pred - y) / n
        let n_f64 = n as f64;
        let mut grad_y_pred = &diff * (2.0 / n_f64);

        // dL/dw2 = a1.T @ grad_y_pred
        // dL/db2 = sum(grad_y_pred, axis=0)
        let a1_t = a1.t().to_owned();
        w2.grad = a1_t.dot(&grad_y_pred);
        b2.grad = grad_y_pred.sum_axis(ndarray::Axis(0)).into_shape((1, 64)).unwrap();

        // dL/da1 = grad_y_pred @ w2.T
        let mut grad_a1 = grad_y_pred.dot(&w2.data.t());

        // dL/dz1 = dL/da1 * ReLU'(z1)
        // ReLU'(z1) = 1 where z1 > 0, else 0
        let relu_mask = z1.mapv(|v| if v > 0.0 { 1.0 } else { 0.0 });
        grad_a1 = grad_a1 * relu_mask;

        // dL/dw1 = x.T @ grad_a1
        let x_t = x_mat.t().to_owned();
        w1.grad = x_t.dot(&grad_a1);
        // dL/db1 = sum(grad_a1, axis=0)
        b1.grad = grad_a1.sum_axis(ndarray::Axis(0)).into_shape((1, 64)).unwrap();

        // --- Update ---
        // Move the line above into a helper to be concise
        let update = |p: &mut Param| p.update(lr);

        // Manually inline for clarity (same as w1.data -= lr * w1.grad, etc.)
        w1.data = &w1.data - &(w1.grad * lr);
        w2.data = &w2.data - &(w2.grad * lr);
        b1.data = &b1.data - &(b1.grad * lr);
        b2.data = &b2.data - &(b2.grad * lr);
    }

    println!("Done! Final predictions ready.");
}

⚠️ Manual gradients — but now at layer level

We've traded scalar-level autograd for manual gradient formulas at the matrix level. This is exactly what every ML framework does internally — even PyTorch's autograd ultimately calls these same matrix gradient operations. The difference is that PyTorch records which layer operations happened and calls the right gradient kernel automatically. We're writing the kernels by hand, but the scope is dramatically reduced: one formula per layer type, not one per scalar.

Measuring the Speedup

Run the Chapter 2 autograd engine on 1000 data points. Time it. Now run this tensor-based version. The difference should be 10–100×, even on your 2015 iMac. Here's why:

1.No per-element allocations. The Chapter 2 autograd creates a new Rc<RefCell> for every single arithmetic operation. With 1000 examples × 64 hidden units × 2 layers, that's ~130,000 heap allocations per forward pass. The tensor version: ~10 allocations total.
2.Vectorized math. ndarray operations use SIMD instructions where available. Your CPU can process 4 f64 values with one instruction instead of four separate ones.
3.BLAS-optimized matmul. Matrix multiplication is the most optimized function in all of computing. ndarray delegates to BLAS libraries (OpenBLAS, Intel MKL) that use cache-tiling, prefetching, and assembly-level tuning.

⏱️ Try it yourself

Add std::time::Instant::now() timing to both versions. Run on 100, 1000, and 10,000 points. Plot the scaling curves. The per-element autograd should scale O(n²) or worse; the tensor version should scale near O(n). This is the difference between "educational" and "practical" ML code.

The Bridge: This Is Your Last "From Scratch" Chapter

The syllabus calls this out explicitly. After Chapter 3:

→You've built the foundations: linear regression (Ch 0), multi-layer networks with manual gradients (Ch 1), scalar autograd (Ch 2), tensor-based training (Ch 3).
→You know what libraries do under the hood: ndarray::dot calls BLAS. Array2::mapv applies element-wise operations in a tight loop. Broadcasting is a stride trick. None of this is magic anymore.
→Starting in Phase 1 (Chapter 4), you'll use libraries for real. You'll build a character-level language model, train it on real text, and generate samples — all using ndarray. You won't reimplement autograd; you'll call .dot() and trust it.

🎯 What you own after this chapter

You understand the entire stack from individual arithmetic operations to batched matrix computations. When Chapter 4 introduces a softmax or a cross-entropy loss, you'll think "that's just a sequence of array operations" rather than "that's scary math." This is the confidence that "from scratch" learning gives you.

Key Takeaways

→ A tensor is a multi-dimensional array. Scalars (0-D), vectors (1-D), matrices (2-D), and higher-dimensional arrays are all tensors.
→ Broadcasting lets you operate on arrays of different shapes without copying data. It's a stride trick, not a memory duplication.
→ Batch computation replaces 1000 scalar operations with one matrix operation. This is 10–100× faster and is the core insight behind GPU acceleration.
→ Manual gradients at layer level replace scalar-level autograd. Fewer nodes, less overhead, more speed. Frameworks automate this with per-layer gradient kernels.
→ This is the end of "from scratch." From Chapter 4 onward, you use ndarray as your tensor library — but you know exactly what it does under the hood.
→ Measure the speedup. Time both versions. The difference is why real ML doesn't use scalar autograd.