Chapter 3
Course Home
Chapter 2's autograd engine tracked every scalar operation individually —
one graph node per +, *, or max(0, x).
That's correct but slow. Real frameworks work at tensor level:
one operation on a batch of 1024 numbers is a single graph node. This chapter
introduces the ndarray crate and rewrites your autograd engine
to operate on whole arrays at once. The speedup is dramatic — and you'll see
why every ML framework is built on tensor operations.
In Chapter 2, every number was a Value node: an Rc<RefCell>
with data, grad, a backward closure, and
a vector of children. For a small network with 100 weights and 1000 training
examples, that's 100,000+ graph nodes per forward pass.
The runtime cost is real:
Rc<RefCell> is a heap allocation with reference counting overhead. 100K nodes = ~10 MB of graph metadata alone.borrow(), reads data, allocates a new Rc, stores a closure, links children — dozens of instructions for one f64 * f64.📊 The fix: batch everything
Instead of tracking 1000 individual loss values, we compute the mean squared error over the whole batch in one shot. One graph node per operation — not one per training example. This is the same optimization that makes GPU training fast: GPUs have thousands of cores designed for exactly this kind of parallel array math.
A tensor is just a multi-dimensional array. You already know the variants:
f64Vec<f64>[32, 28, 28, 3]
The key idea: a tensor has a shape (the size of each dimension)
and a data type (f64, f32, etc.).
All elements are stored in a single contiguous block of memory, laid out
according to a stride pattern.
use ndarray::Array2; // Create a 3×2 matrix let mut a: Array2<f64> = Array2::zeros((3, 2)); a[[0, 0]] = 1.0; a[[1, 1]] = 2.0; // a = [[1.0, 0.0], // [0.0, 2.0], // [0.0, 0.0]] // Matrix multiplication let b = Array2::ones((2, 4)); let c = a.dot(&b); // shape: (3, 4) // Element-wise operations let d = &a + &a; // each element added let e = a * 2.0; // broadcast scalar multiply
The ndarray crate gives us NumPy-like array operations in Rust.
It uses BLAS under the hood for matrix multiplication (through the
blas-src feature), and it has a rich API for slicing,
broadcasting, and reshaping.
What happens when you add a vector to a matrix? Consider:
use ndarray::Array2; // 3 rows, 2 columns let mut matrix: Array2<f64> = Array2::zeros((3, 2)); matrix += 1.0; // scalar broadcast: add 1.0 to every element // Row broadcast: add a different bias to each column let biases = arr1(&[0.1, 0.2]); // shape (2,) // matrix + biases broadcasts along rows: // each row gets bias[0] added to col 0, bias[1] added to col 1
Broadcasting is the rule for applying an operation between arrays of different shapes. The smaller array is "stretched" to match the larger one along dimension 1. This is how we add a bias vector to every row of a matrix without writing a loop.
The rules:
(3, 2) and (2,) → align on the right → (3, 2) and (1, 2)3 vs 1 → broadcast, 2 vs 2 → match💡 No copies are made
Broadcasting doesn't actually duplicate memory. ndarray uses
stride tricks to simulate the expanded shape — the bias vector is still
stored as 2 f64 values, but the operation acts as if it's
a matrix. This is memory-efficient and fast.
ndarray
Instead of wrapping every f64 in a Value, we
package all our network parameters into tensor-valued parameters.
A weight matrix W of shape (input_size, output_size)
is a single parameter with a single gradient of the same shape.
Here's the core idea — a Param struct that holds a tensor
and its gradient:
use ndarray::Array2;
pub struct Param {
pub data: Array2<f64>,
pub grad: Array2<f64>,
}
impl Param {
pub fn new(shape: (usize, usize)) -> Self {
// Initialize with small random values
let data = Array2::from_shape_fn(shape, |_| rand_f64() * 0.1);
let grad = Array2::zeros(shape);
Param { data, grad }
}
pub fn zero_grad(&mut self) {
self.grad.fill(0.0);
}
pub fn update(&mut self, lr: f64) {
// self.data -= lr * self.grad (element-wise)
self.data = &self.data - &(self.grad * lr);
}
}
fn rand_f64() -> f64 {
// Simple LCG for reproducibility — no external deps for this snippet
use std::time::{SystemTime, UNIX_EPOCH};
static mut SEED: u64 = 0;
unsafe {
if SEED == 0 {
SEED = SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap()
.as_nanos() as u64;
}
SEED = SEED.wrapping_mul(6364136223846793005).wrapping_add(1442695040888963407);
(SEED >> 33) as f64 / (1u64 << 31) as f64 * 0.2 - 0.1
}
}
📐 Key difference from Chapter 2
In Chapter 2, each parameter was a Value node and gradient
computation happened per-node via backward closures. Here, the
gradient for the entire weight matrix is computed as a single
matrix operation: grad = input.T @ loss_grad. No graph,
no closures, no per-element iteration. Just linear algebra.
But wait — doesn't this mean we're back to manual gradient formulas? Yes and no. We're writing layer-level gradient formulas instead of scalar-level ones. The savings come from:
Param per weight matrix, not one Value per scalar.ndarray::dot calls highly optimized linear algebra routines.Here's the full training loop for our sine-wave fitter, now operating on the entire dataset at once:
use ndarray::{Array1, Array2};
fn main() {
// Generate synthetic data: y = sin(x) + noise
let n = 1000;
let xs: Array1<f64> = Array1::linspace(-3.0, 3.0, n);
let ys: Array1<f64> = xs.mapv(|x| (x * std::f64::consts::PI).sin())
+ Array1::from_shape_fn(n, |_| rand_f64() * 0.1);
// Network: 1 input -> 64 hidden (ReLU) -> 1 output
let mut w1 = Param::new((1, 64));
let mut b1 = Param::new((1, 64));
let mut w2 = Param::new((64, 1));
let mut b2 = Param::new((1, 1));
// Reshape input to (n, 1) matrix
let x_mat = xs.into_shape((n, 1)).unwrap();
let y_mat = ys.into_shape((n, 1)).unwrap();
let lr = 0.01;
let steps = 2000;
for step in 0..steps {
// --- Forward pass ---
// z1 = x @ w1 + b1 (shape: n × 64)
let z1 = x_mat.dot(&w1.data) + &b1.data;
// a1 = ReLU(z1)
let a1 = z1.mapv(|v| if v > 0.0 { v } else { 0.0 });
// y_pred = a1 @ w2 + b2 (shape: n × 1)
let y_pred = a1.dot(&w2.data) + &b2.data;
// --- Loss: MSE = mean((y_pred - y)^2) ---
let diff = &y_pred - &y_mat;
let loss = diff.mapv(|v| v * v).mean().unwrap();
if step % 200 == 0 {
println!("step {step:4}: loss = {loss:.6}");
}
// --- Backward pass (manual, layer-level) ---
// dL/dy_pred = 2 * (y_pred - y) / n
let n_f64 = n as f64;
let mut grad_y_pred = &diff * (2.0 / n_f64);
// dL/dw2 = a1.T @ grad_y_pred
// dL/db2 = sum(grad_y_pred, axis=0)
let a1_t = a1.t().to_owned();
w2.grad = a1_t.dot(&grad_y_pred);
b2.grad = grad_y_pred.sum_axis(ndarray::Axis(0)).into_shape((1, 64)).unwrap();
// dL/da1 = grad_y_pred @ w2.T
let mut grad_a1 = grad_y_pred.dot(&w2.data.t());
// dL/dz1 = dL/da1 * ReLU'(z1)
// ReLU'(z1) = 1 where z1 > 0, else 0
let relu_mask = z1.mapv(|v| if v > 0.0 { 1.0 } else { 0.0 });
grad_a1 = grad_a1 * relu_mask;
// dL/dw1 = x.T @ grad_a1
let x_t = x_mat.t().to_owned();
w1.grad = x_t.dot(&grad_a1);
// dL/db1 = sum(grad_a1, axis=0)
b1.grad = grad_a1.sum_axis(ndarray::Axis(0)).into_shape((1, 64)).unwrap();
// --- Update ---
// Move the line above into a helper to be concise
let update = |p: &mut Param| p.update(lr);
// Manually inline for clarity (same as w1.data -= lr * w1.grad, etc.)
w1.data = &w1.data - &(w1.grad * lr);
w2.data = &w2.data - &(w2.grad * lr);
b1.data = &b1.data - &(b1.grad * lr);
b2.data = &b2.data - &(b2.grad * lr);
}
println!("Done! Final predictions ready.");
}
⚠️ Manual gradients — but now at layer level
We've traded scalar-level autograd for manual gradient formulas at the matrix level. This is exactly what every ML framework does internally — even PyTorch's autograd ultimately calls these same matrix gradient operations. The difference is that PyTorch records which layer operations happened and calls the right gradient kernel automatically. We're writing the kernels by hand, but the scope is dramatically reduced: one formula per layer type, not one per scalar.
Run the Chapter 2 autograd engine on 1000 data points. Time it. Now run this tensor-based version. The difference should be 10–100×, even on your 2015 iMac. Here's why:
Rc<RefCell> for every single arithmetic operation. With 1000 examples × 64 hidden units × 2 layers, that's ~130,000 heap allocations per forward pass. The tensor version: ~10 allocations total.ndarray operations use SIMD instructions where available. Your CPU can process 4 f64 values with one instruction instead of four separate ones.ndarray delegates to BLAS libraries (OpenBLAS, Intel MKL) that use cache-tiling, prefetching, and assembly-level tuning.⏱️ Try it yourself
Add std::time::Instant::now() timing to both versions.
Run on 100, 1000, and 10,000 points. Plot the scaling curves. The
per-element autograd should scale O(n²) or worse; the tensor version
should scale near O(n). This is the difference between "educational"
and "practical" ML code.
The syllabus calls this out explicitly. After Chapter 3:
ndarray::dot calls BLAS. Array2::mapv applies element-wise operations in a tight loop. Broadcasting is a stride trick. None of this is magic anymore.ndarray. You won't reimplement autograd; you'll call .dot() and trust it.🎯 What you own after this chapter
You understand the entire stack from individual arithmetic operations to batched matrix computations. When Chapter 4 introduces a softmax or a cross-entropy loss, you'll think "that's just a sequence of array operations" rather than "that's scary math." This is the confidence that "from scratch" learning gives you.
ndarray as your tensor library — but you know exactly what it does under the hood.