Chapter 2

The Computational Graph — Tracking What Affects What

In Chapter 1, we computed gradients by hand for 97 parameters. That was tedious and error-prone. This chapter builds an autograd engine that tracks which operations affect which values, then automatically computes gradients by walking the graph backward. Backpropagation is bookkeeping, not calculus — once you've built this, you'll never be intimidated by it again.

The Problem with Manual Gradients

In Chapter 1, we computed gradients for each of 97 parameters by hand. For each parameter, we had to derive its contribution to the loss and write the update code manually. This works for 97 parameters. It does not work for 1.8 trillion (GPT-4).

Worse, every time we change the network architecture — add a layer, change the activation function — we have to re-derive and re-write all the gradient equations. This is a productivity disaster.

⚠️ The key insight

Every computation in your network is a combination of simple operations: addition, multiplication, ReLU, etc. Each operation knows how to compute its own local gradient. If we record the sequence of operations (the computational graph), we can propagate gradients backward automatically — regardless of the network structure.

The `Value` Struct

The core idea: instead of plain f64 values, we wrap every number in a Value that tracks:

1.data — the actual number (the forward value)
2.grad — the gradient of the loss with respect to this value
3.backward — a closure that computes the local gradient and propagates it to prev
4.prev — the child Values that produced this one (the graph edges)

tau-autograd/src/value.rs

use std::cell::RefCell;
use std::rc::Rc;

pub struct Value {
    pub data: f64,
    pub grad: f64,
    pub backward: Option<Box<dyn Fn()>>,
    pub prev: Vec<Rc<RefCell<Value>>>,
}

impl Value {
    pub fn new(data: f64) -> Self {
        Value {
            data,
            grad: 0.0,
            backward: None,
            prev: Vec::new(),
        }
    }
}

💡 Why Rc<RefCell<Value>>?

A Value can be an input to multiple downstream operations — it has multiple "parents." Rust's ownership model doesn't allow shared ownership by default. Rc (Reference Counted) enables shared ownership. RefCell allows interior mutability — we need to modify the grad field even through a shared reference. This is the educational pain the syllabus warned about: the borrow checker fights graph structures, and you'll learn exactly why.

Adding Operations

Now we define operations on Value. Each operation:

1.Creates a new output Value
2.Records the inputs as prev
3.Stores a backward closure that knows how to propagate gradients through this specific operation

tau-autograd/src/value.rs — operations

impl Value {
    pub fn add(self: &Rc<RefCell<Self>>, other: &Rc<RefCell<Self>>) -> Rc<RefCell<Value>> {
        let a = self.borrow().data;
        let b = other.borrow().data;

        let out = Rc::new(RefCell::new(Value::new(a + b)));

        // Clone Rcs for the closure
        let self_rc = Rc::clone(self);
        let other_rc = Rc::clone(other);
        let out_rc = Rc::downgrade(&out);

        out.borrow_mut().backward = Some(Box::new(move || {
            let out_grad = out_rc.upgrade().unwrap().borrow().grad;
            self_rc.borrow_mut().grad += out_grad * 1.0; // d(a+b)/da = 1
            other_rc.borrow_mut().grad += out_grad * 1.0; // d(a+b)/db = 1
        }));

        out.borrow_mut().prev = vec![Rc::clone(self), Rc::clone(other)];
        out
    }

    pub fn mul(self: &Rc<RefCell<Self>>, other: &Rc<RefCell<Self>>) -> Rc<RefCell<Value>> {
        let a = self.borrow().data;
        let b = other.borrow().data;

        let out = Rc::new(RefCell::new(Value::new(a * b)));

        let self_rc = Rc::clone(self);
        let other_rc = Rc::clone(other);
        let out_rc = Rc::downgrade(&out);

        out.borrow_mut().backward = Some(Box::new(move || {
            let out_grad = out_rc.upgrade().unwrap().borrow().grad;
            let a = self_rc.borrow().data;
            let b = other_rc.borrow().data;
            self_rc.borrow_mut().grad += out_grad * b; // d(a*b)/da = b
            other_rc.borrow_mut().grad += out_grad * a; // d(a*b)/db = a
        }));

        out.borrow_mut().prev = vec![Rc::clone(self), Rc::clone(other)];
        out
    }

    pub fn relu(self: &Rc<RefCell<Self>>) -> Rc<RefCell<Value>> {
        let x = self.borrow().data;
        let out_data = if x > 0.0 { x } else { 0.0 };

        let out = Rc::new(RefCell::new(Value::new(out_data)));

        let self_rc = Rc::clone(self);
        let out_rc = Rc::downgrade(&out);

        out.borrow_mut().backward = Some(Box::new(move || {
            let out_grad = out_rc.upgrade().unwrap().borrow().grad;
            let x = self_rc.borrow().data;
            let relu_grad = if x > 0.0 { 1.0 } else { 0.0 };
            self_rc.borrow_mut().grad += out_grad * relu_grad;
        }));

        out.borrow_mut().prev = vec![Rc::clone(self)];
        out
    }
}

Look at those backward closures. Each one says: "the gradient of this operation with respect to each input equals the output gradient multiplied by the local derivative." That's the chain rule — but encoded imperatively, not as a formula.

The Backward Pass — Walking the Graph in Reverse

With the graph built during the forward pass, computing all gradients is a three-step process:

1.Start at the output. Set its gradient to 1.0 (the gradient of the loss with respect to itself).
2.Walk the graph in topological order (children before parents).
3.Call each node's backward closure, which adds gradients to its children.

tau-autograd/src/value.rs — backward pass

impl Value {
    pub fn backward(self: &Rc<RefCell<Self>>) {
        // Topological sort (DFS-based)
        let mut order: Vec<Rc<RefCell<Value>>> = Vec::new();
        let mut visited: Vec<*const RefCell<Value>> = Vec::new();

        fn topo_sort(
            node: &Rc<RefCell<Value>>,
            order: &mut Vec<Rc<RefCell<Value>>>,
            visited: &mut Vec<*const RefCell<Value>>,
        ) {
            let ptr: *const RefCell<Value> = &**node;
            if visited.contains(&ptr) {
                return;
            }
            visited.push(ptr);
            for child in &node.borrow().prev {
                topo_sort(child, order, visited);
            }
            order.push(Rc::clone(node));
        }

        topo_sort(self, &mut order, &mut visited);

        // Set output gradient
        self.borrow_mut().grad = 1.0;

        // Walk in reverse
        for node in order.iter().rev() {
            if let Some(ref backward_fn) = node.borrow().backward {
                backward_fn();
            }
        }
    }
}

The backward pass is just the forward pass in reverse, applying the chain rule at each node. That's why it's called backpropagation: errors propagate backward through the same graph.

Using the Autograd Engine

Here's what our Chapter 1 sine-wave fitter looks like with the autograd engine:

use std::cell::RefCell;
use std::rc::Rc;
use tau_autograd::value::Value;

fn main() {
    // Create parameters as Value nodes
    let w1 = Rc::new(RefCell::new(Value::new(0.5)));
    let b = Rc::new(RefCell::new(Value::new(0.1)));

    // Input and target
    let x = Rc::new(RefCell::new(Value::new(2.0)));
    let y_target = Rc::new(RefCell::new(Value::new(1.8)));

    // Forward pass: y_pred = w1 * x + b
    let w1x = w1.mul(&x); // w1 * x
    let y_pred = w1x.add(&b); // w1*x + b

    // Loss: (y_pred - y_target)^2
    let diff = y_pred.add(&y_target.mul(&Rc::new(RefCell::new(Value::new(-1.0)))));
    let loss = diff.mul(&diff); // square the difference

    // Backward pass
    loss.borrow_mut().grad = 1.0;
    loss.backward();

    println!("w1 grad: {}", w1.borrow().grad);
    // Prints the gradient of loss with respect to w1 —
    // computed automatically through the graph.
}

No manual gradient derivation. No per-parameter update code. Just build the graph, call backward(), and read the gradients. You can change the network structure (add layers, swap activations) and the autograd engine handles it automatically.

The Borrow Checker Will Fight You

This is the hardest part of the chapter, and the syllabus calls it out explicitly: the borrow checker will fight you on the graph structure. Here's why:

→A computational graph is a DAG (directed acyclic graph). Each node can have multiple children (fan-out) and multiple parents (fan-in).
→Rust's ownership model assumes single ownership — every value has exactly one owner. A graph node violates this by definition.
→We work around it with Rc<RefCell<T>>, but this adds runtime overhead and complexity. The Weak references (via Rc::downgrade) in the backward closures prevent reference cycles.

⚠️ Educational pain — intentionally

This is the only chapter where we fight the borrow checker this hard. By the end, you understand exactly why ML frameworks are written in C++ with Python bindings (Python's GC handles the graph, C++ handles the math). In Rust, we can build a working autograd engine — it's just not idiomatic Rust. Real Rust ML libraries like candle and burn use different approaches (tensor-level ops, no per-value Rc overhead).

If you get stuck, remember: the borrow checker errors are telling you exactly what the graph structure needs. Each error teaches you something about ownership that you'll carry into every other Rust project. This is educational pain by design — the syllabus put it here deliberately.

Why This Matters

Once you've built an autograd engine:

→Backpropagation is demystified forever. It's not magic — it's a topological sort plus the chain rule, both of which you just implemented.
→Every neural network framework (PyTorch, TensorFlow, JAX) has an autograd engine at its core. You now understand what they do under the hood.
→You can experiment freely. Want to try a new activation function? Just add an operation. The gradients are handled automatically.
→This is the last "from scratch" chapter. From here on, we use ndarray and higher-level tools — but you'll know exactly what they do under the hood.

📐 What we built vs. production autograd

Our engine tracks every scalar operation individually. PyTorch's autograd works at the tensor level — one operation on a 1024×1024 matrix is recorded as a single graph node, not 1 million scalar nodes. The principle is identical; only the granularity differs.

Key Takeaways

→ Backpropagation is bookkeeping — record the operations, then walk the graph in reverse applying the chain rule.
→ The Value struct wraps a number with its gradient, a backward closure, and child pointers. This is the fundamental autograd primitive.
→ Each operation (add, mul, relu) creates a graph node and stores a local gradient rule in its backward closure.
→ Rc<RefCell> is the escape hatch for graph structures in Rust — shared ownership with interior mutability. It works, but it has overhead.
→ After this chapter, you will never be intimidated by gradient descent again. You built it from the metal up.