Chapter 2
Course HomeIn Chapter 1, we computed gradients by hand for 97 parameters. That was tedious and error-prone. This chapter builds an autograd engine that tracks which operations affect which values, then automatically computes gradients by walking the graph backward. Backpropagation is bookkeeping, not calculus — once you've built this, you'll never be intimidated by it again.
In Chapter 1, we computed gradients for each of 97 parameters by hand. For each parameter, we had to derive its contribution to the loss and write the update code manually. This works for 97 parameters. It does not work for 1.8 trillion (GPT-4).
Worse, every time we change the network architecture — add a layer, change the activation function — we have to re-derive and re-write all the gradient equations. This is a productivity disaster.
⚠️ The key insight
Every computation in your network is a combination of simple operations: addition, multiplication, ReLU, etc. Each operation knows how to compute its own local gradient. If we record the sequence of operations (the computational graph), we can propagate gradients backward automatically — regardless of the network structure.
Value Struct
The core idea: instead of plain f64 values, we wrap every number
in a Value that tracks:
data — the actual number (the forward value)grad — the gradient of the loss with respect to this valuebackward — a closure that computes the local gradient and propagates it to prevprev — the child Values that produced this one (the graph edges)use std::cell::RefCell;
use std::rc::Rc;
pub struct Value {
pub data: f64,
pub grad: f64,
pub backward: Option<Box<dyn Fn()>>,
pub prev: Vec<Rc<RefCell<Value>>>,
}
impl Value {
pub fn new(data: f64) -> Self {
Value {
data,
grad: 0.0,
backward: None,
prev: Vec::new(),
}
}
}
💡 Why Rc<RefCell<Value>>?
A Value can be an input to multiple downstream operations —
it has multiple "parents." Rust's ownership model doesn't allow shared
ownership by default. Rc (Reference Counted) enables shared
ownership. RefCell allows interior mutability — we need to
modify the grad field even through a shared reference. This
is the educational pain the syllabus warned about: the
borrow checker fights graph structures, and you'll learn exactly why.
Now we define operations on Value. Each operation:
Valueprevbackward closure that knows how to propagate gradients through this specific operationimpl Value {
pub fn add(self: &Rc<RefCell<Self>>, other: &Rc<RefCell<Self>>) -> Rc<RefCell<Value>> {
let a = self.borrow().data;
let b = other.borrow().data;
let out = Rc::new(RefCell::new(Value::new(a + b)));
// Clone Rcs for the closure
let self_rc = Rc::clone(self);
let other_rc = Rc::clone(other);
let out_rc = Rc::downgrade(&out);
out.borrow_mut().backward = Some(Box::new(move || {
let out_grad = out_rc.upgrade().unwrap().borrow().grad;
self_rc.borrow_mut().grad += out_grad * 1.0; // d(a+b)/da = 1
other_rc.borrow_mut().grad += out_grad * 1.0; // d(a+b)/db = 1
}));
out.borrow_mut().prev = vec![Rc::clone(self), Rc::clone(other)];
out
}
pub fn mul(self: &Rc<RefCell<Self>>, other: &Rc<RefCell<Self>>) -> Rc<RefCell<Value>> {
let a = self.borrow().data;
let b = other.borrow().data;
let out = Rc::new(RefCell::new(Value::new(a * b)));
let self_rc = Rc::clone(self);
let other_rc = Rc::clone(other);
let out_rc = Rc::downgrade(&out);
out.borrow_mut().backward = Some(Box::new(move || {
let out_grad = out_rc.upgrade().unwrap().borrow().grad;
let a = self_rc.borrow().data;
let b = other_rc.borrow().data;
self_rc.borrow_mut().grad += out_grad * b; // d(a*b)/da = b
other_rc.borrow_mut().grad += out_grad * a; // d(a*b)/db = a
}));
out.borrow_mut().prev = vec![Rc::clone(self), Rc::clone(other)];
out
}
pub fn relu(self: &Rc<RefCell<Self>>) -> Rc<RefCell<Value>> {
let x = self.borrow().data;
let out_data = if x > 0.0 { x } else { 0.0 };
let out = Rc::new(RefCell::new(Value::new(out_data)));
let self_rc = Rc::clone(self);
let out_rc = Rc::downgrade(&out);
out.borrow_mut().backward = Some(Box::new(move || {
let out_grad = out_rc.upgrade().unwrap().borrow().grad;
let x = self_rc.borrow().data;
let relu_grad = if x > 0.0 { 1.0 } else { 0.0 };
self_rc.borrow_mut().grad += out_grad * relu_grad;
}));
out.borrow_mut().prev = vec![Rc::clone(self)];
out
}
}
Look at those backward closures. Each one says: "the gradient
of this operation with respect to each input equals the output gradient
multiplied by the local derivative." That's the chain rule
— but encoded imperatively, not as a formula.
With the graph built during the forward pass, computing all gradients is a three-step process:
1.0 (the gradient of the loss with respect to itself).backward closure, which adds gradients to its children.impl Value {
pub fn backward(self: &Rc<RefCell<Self>>) {
// Topological sort (DFS-based)
let mut order: Vec<Rc<RefCell<Value>>> = Vec::new();
let mut visited: Vec<*const RefCell<Value>> = Vec::new();
fn topo_sort(
node: &Rc<RefCell<Value>>,
order: &mut Vec<Rc<RefCell<Value>>>,
visited: &mut Vec<*const RefCell<Value>>,
) {
let ptr: *const RefCell<Value> = &**node;
if visited.contains(&ptr) {
return;
}
visited.push(ptr);
for child in &node.borrow().prev {
topo_sort(child, order, visited);
}
order.push(Rc::clone(node));
}
topo_sort(self, &mut order, &mut visited);
// Set output gradient
self.borrow_mut().grad = 1.0;
// Walk in reverse
for node in order.iter().rev() {
if let Some(ref backward_fn) = node.borrow().backward {
backward_fn();
}
}
}
}
The backward pass is just the forward pass in reverse, applying the chain rule at each node. That's why it's called backpropagation: errors propagate backward through the same graph.
Here's what our Chapter 1 sine-wave fitter looks like with the autograd engine:
use std::cell::RefCell;
use std::rc::Rc;
use tau_autograd::value::Value;
fn main() {
// Create parameters as Value nodes
let w1 = Rc::new(RefCell::new(Value::new(0.5)));
let b = Rc::new(RefCell::new(Value::new(0.1)));
// Input and target
let x = Rc::new(RefCell::new(Value::new(2.0)));
let y_target = Rc::new(RefCell::new(Value::new(1.8)));
// Forward pass: y_pred = w1 * x + b
let w1x = w1.mul(&x); // w1 * x
let y_pred = w1x.add(&b); // w1*x + b
// Loss: (y_pred - y_target)^2
let diff = y_pred.add(&y_target.mul(&Rc::new(RefCell::new(Value::new(-1.0)))));
let loss = diff.mul(&diff); // square the difference
// Backward pass
loss.borrow_mut().grad = 1.0;
loss.backward();
println!("w1 grad: {}", w1.borrow().grad);
// Prints the gradient of loss with respect to w1 —
// computed automatically through the graph.
}
No manual gradient derivation. No per-parameter update code. Just build the
graph, call backward(), and read the gradients. You can change
the network structure (add layers, swap activations) and the autograd engine
handles it automatically.
This is the hardest part of the chapter, and the syllabus calls it out explicitly: the borrow checker will fight you on the graph structure. Here's why:
Rc<RefCell<T>>, but this adds runtime overhead and complexity. The Weak references (via Rc::downgrade) in the backward closures prevent reference cycles.⚠️ Educational pain — intentionally
This is the only chapter where we fight the borrow checker this hard. By
the end, you understand exactly why ML frameworks are written in C++ with
Python bindings (Python's GC handles the graph, C++ handles the math). In
Rust, we can build a working autograd engine — it's just not idiomatic Rust.
Real Rust ML libraries like candle and burn use
different approaches (tensor-level ops, no per-value Rc overhead).
If you get stuck, remember: the borrow checker errors are telling you exactly what the graph structure needs. Each error teaches you something about ownership that you'll carry into every other Rust project. This is educational pain by design — the syllabus put it here deliberately.
Once you've built an autograd engine:
ndarray and higher-level tools — but you'll know exactly what they do under the hood.📐 What we built vs. production autograd
Our engine tracks every scalar operation individually. PyTorch's autograd works at the tensor level — one operation on a 1024×1024 matrix is recorded as a single graph node, not 1 million scalar nodes. The principle is identical; only the granularity differs.
Value struct wraps a number with its gradient, a backward closure, and child pointers. This is the fundamental autograd primitive.
add, mul, relu) creates a graph node and stores a local gradient rule in its backward closure.