We will go from illustrating differentiation and finding derivatives in Python, all the way down till the implementation of the backpropagation algorithm. Even if mathematically the idea of derivatives is very familiar, it still makes sense to see its various visualization plots via code.
Differentiation¶
The code below shows the function $f(x) = x^2$ in action. Our goal is to see how much the output changes as we modify the input by some value h. You can modify and see behavior for more values on other functions as well.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
def f(x):
return x**2
x = 3.0
for h in [10, 1, 0.1, 0]:
print(f"If we shift input by {h}, output becomes {f(x+h)}")
If we shift input by 10, output becomes 169.0 If we shift input by 1, output becomes 16.0 If we shift input by 0.1, output becomes 9.610000000000001 If we shift input by 0, output becomes 9.0
The well-known equation for the change $ \Delta y = f(x + \Delta x) - f(x)$ we can code and visualize for various cases as follows. Again, feel free to modify the values and functions to observe the behavior of change.
h = 1.0
dx = h
dy = f(x+h) - f(x)
print(f"Δx: {dx}")
print(f"Δy: {dy}")
print(f"When you change x by {dx} unit, y changes by {dy} units.")
Δx: 1.0 Δy: 7.0 When you change x by 1.0 unit, y changes by 7.0 units.
def plot_delta(x, h, start=-4, stop=4, num=30):
# `np.linspace` returns an array of num inputs within a range.
x_all = np.linspace(start, stop, num)
y_all = f(x_all)
plt.figure(figsize=(4, 4))
plt.plot(x_all, y_all)
# dx & dy
plt.plot([x, x + h], [f(x), f(x)], color='r')
plt.plot([x + h, x + h], [f(x), f(x + h)], color='r')
plot_delta(x=2, h=1)
How to find if the ouput changes significantly when we change the input by some amount $h$? We should be familiar with the rate of change ratio when we choose a small step size $h$:
$$ \dfrac{\Delta y}{\Delta x} = \dfrac{f(x + h) - f(x)}{h}.$$
def plot_roc(x, h):
dx = h
dy = f(x + h) - f(x)
plot_delta(x, h)
print(f"Rate of change is {dy / dx}")
plot_roc(3, 1)
Rate of change is 7.0
plot_roc(3, 0.5)
Rate of change is 6.5
plot_roc(1, 1)
Rate of change is 3.0
plot_roc(-2, 0.5)
Rate of change is -3.5
The rate of change for different values of $h$ are different at the same point $x$. We would like to come up with a single value that would tell how significantly $y$ changes at a given point $x$ within the function:
$$ L = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} $$
Simply, this limit tells us how much the value of the output will change when we change the input by just a very small amount.
Note
Essentially, derivative is a function: it assigns to each input $ x $ the rate at which the original function changes at that point. Meaning, for each input, the limit above produces a value, and as $ x $ changes, this value changes as well.
x = 3
h = 0.000001 # The limit of h approaching 0
d = (f(x + h) - f(x)) / h
f"The value of derivative function is {d}"
'The value of derivative function is 6.000001000927568'
Partial Derivatives¶
When we extend this idea to multivariable calculus, we use partial derivatives. A partial derivative with respect to (wrt) a given variable measures how much the output changes when we nudge that variable alone by a very small amount, while keeping all other variables fixed. Below we will see how it behaves during addition $f(x, y)=x+y$.
f = lambda x, y: x + y
x = 2
y = 3
f(x, y)
5
Partial derivatives w.r.t $x$ an $y$ will be as follows:
h = 0.000001
(f(x + h, y) - f(x, y)) / h
1.000000000139778
h = 0.000001
(f(x, y+h) - f(x, y)) / h
1.000000000139778
No matter what the input values are, the expression will always approach $1.0$ for addition, because adding a constant increases the output by the same amount as the input change.
for x, y in zip([-20, 2, 3], [300, 75, 10]):
print(f'x={x}, y={y}: {(f(x + h, y) - f(x, y)) / h}')
x=-20, y=300: 0.9999999974752427 x=2, y=75: 0.9999999974752427 x=3, y=10: 1.0000000010279564
Indeed, if we have a simple addition $x + y$, then increasing $x$ or $y$ by some amount will increase the result by the exact same amount. Assertion will work for any number $h$ gets.
h = 10
assert f(x+h, y) - f(x, y) == h
assert f(x, y+h) - f(x, y) == h
Let's now see the case for multiplication $f(x, y)=x * y$.
f = lambda x, y: x * y
x = 2
y = 3
h = 1e-5 # scientific notation for 0.00001
(f(x + h, y) - f(x, y)) / h # wrt x
3.000000000064062
for x in [-20, 2, 3]:
print(f'x={x}, y={y}: {(f(x + h, y) - f(x, y)) / h}')
x=-20, y=3: 2.999999999531155 x=2, y=3: 3.000000000064062 x=3, y=3: 3.000000000064062
x = 10
h = 5
pdx = (f(x+h, y) - f(x, y)) / h
print(pdx, y)
assert round(pdx, 2) == round(y, 2)
3.0 3
Finally, we will consider a complex function with three variables $f(x, y, z)=x^2+y^3-z$.
def f(x, y, z):
return x**2 + y**3 - z
x = 2
y = 3
z = 4
f(x, y, z)
27
h = 1
f(x + h, y, z)
32
f(x + h, y, z) - f(x, y, z)
5
(f(x + h, y, z) - f(x, y, z)) / h
5.0
h = 0.00001 # change in limit
pdx = (f(x + h, y, z) - f(x, y, z)) / h
pdx
4.000010000027032
assert 2*x == round(pdx)
Partial derivative w.r.t $x$ is $2x$ (by the power rule), and for $x=2$ indeed we get $4$.
Execise
Code the partial derivative w.r.t $y$ and $z$ and verify if the result correct.
Computation Graph¶
Info
The following source was used in preparing this material: Andrej Karpathy's lecture on Micrograd.
Micrograd is an educational library that demonstrates the core ideas behind automatic differentiation. The idea behind it is the same one used in major DL frameworks. Micrograd closely mirrors the logic of the PyTorch autograd engine. In this and the following notebooks, we will modify micrograd further to improve correspondence with PyTorch and make the explanations clearer.
Tip
It is recommended to run this notebook on Google Colab, where Graphviz is available by default. For local development, graph visualization requires both the Python wrapper (pip install graphviz) and the Graphviz system executable to be installed and available on the system PATH.
# This is a graph visualization code from micrograd, no need to understand the details
# https://github.com/karpathy/micrograd/blob/master/trace_graph.ipynb
from graphviz import Digraph
def trace(root):
nodes, edges = set(), set()
def build(v):
if v not in nodes:
nodes.add(v)
for child in v._prev:
edges.add((child, v))
build(child)
build(root)
return nodes, edges
def draw_dot(root, format='svg', rankdir='LR'):
"""
format: png | svg | ...
rankdir: TB (top to bottom graph) | LR (left to right)
"""
assert rankdir in ['LR', 'TB']
nodes, edges = trace(root)
dot = Digraph(format=format, graph_attr={'rankdir': rankdir}) #, node_attr={'rankdir': 'TB'})
for n in nodes:
dot.node(name=str(id(n)), label = "{ %s | data %.3f | grad %.3f }" % (n.label, n.data, n.grad), shape='record')
if n._op:
dot.node(name=str(id(n)) + n._op, label=n._op)
dot.edge(str(id(n)) + n._op, str(id(n)))
for n1, n2 in edges:
dot.edge(str(id(n1)), str(id(n2)) + n2._op)
return dot
The purpose of the Value class below is to store both numerical values and the history of operations that produced them. Each instance represents a node in a [computation graph](computation graph). When a Value object is created, it stores:
- the numerical result of a computation (
data), - references to the input values that produced it (
_prev), - the operation that combined those inputs (
_op), - label provided by us in order to see it on the computation graph (
label), - gradient of the final output w.r.t this value (
grad).
Arithmetic operations such as addition and multiplication are overloaded so that, instead of returning plain numbers, they return new Value objects. These new objects remember which values were combined and how they were combined. As computations are chained together, this process builds a directed graph, later used to propagate gradients.
# Value class stores a number and "remembers" information about its origins
class Value:
def __init__(self, data, _prev=(), _op='', label=''):
self.data = data
self._prev = _prev
self._op = _op
self.label = label
self.grad = 0
def __add__(self, other):
data = self.data + other.data
out = Value(data, (self, other), '+')
return out
def __mul__(self, other):
data = self.data * other.data
out = Value(data, (self, other), "*")
return out
def __repr__(self):
return f"Value(data={self.data}, grad={self.grad})"
Below is the example of a simple expression built with the help of Value class: $L=c*d$ where $c=a+b$.
a = Value(5, label='a')
b = Value(3, label='b')
c = a + b; c.label = 'c'
d = Value(10, label='d')
L = c * d; L.label = 'L'
print(a, a._prev)
print(L, L._prev)
Value(data=5, grad=0) () Value(data=80, grad=0) (Value(data=8, grad=0), Value(data=10, grad=0))
We can use the helper function built with graphviz above to plot our computation graph as a nice visualization.
draw_dot(c)
draw_dot(L)
Gradient¶
Gradient is vector of partial derivatives. We want to know how much changing each variable will affect the output of L (loss). We will store those partial derivatives inside each grad variable of each Value object. We start by noting that the derivative of a variable with respect to itself is $1$ (you get the same $dx/dy$).
L.grad = 1.0
f = lambda x: x
h = 1e-5
pdx = (f(x + h) - f(x)) / h
assert round(pdx) == 1
Now let's see how changing other variables will affect the eventual result.
# extended version of the function we saw previously
def f(ha=0, hb=0, hc=0, hd=0):
a = Value(5 + ha, label='a')
b = Value(3 + hb, label='b')
c = a + b + Value(hc); c.label = 'c'
d = Value(10 + hd, label='d')
L = c * d; L.label = 'L'
return L.data
h = 1e-5
(f(hd=h) - f()) / h
7.999999999697137
From the computational graph we can also know that $L=c*d$. When we change the value of $d$ just a little bit (derivative of $L$ wrt $d$) the value of $L$ will change by the amount of $c$, which is $8.0$. We saw it above in the partial derivative of a multiplication.
d.grad = c.data
c.grad = d.data
With the same logic, the derivative of $L$ wrt $c$ will be the value of $d$, which is $10.0$ in our specific case. We can verify it.
(f(hc=h) - f()) / h
10.000000000331966
Chain Rule¶
To determine how much changing earlier variables in the computation graph will affect the loss, we can apply the chain rule. Simply, the derivative of $L$ wrt $a$ is the derivative of $c$ wrt $a$ multiplied by the derivative of $L$ wrt $c$:
$$ \frac{dL}{da} = \frac{dL}{dc} \cdot \frac{dc}{da}. $$
The derivate of $c$ both wrt $a$ and $b$ is $1.0$ due to the property of addition we had seen previously. From here:
a.grad = 1.0 * c.grad
b.grad = 1.0 * c.grad
a.grad, b.grad
(10.0, 10.0)
We can verify it as well. Let's see how much $L$ gets affected, when we shift $a$ or $b$ by a very small amount.
(f(ha=h) - f()) / h
10.000000000331966
(f(hb=h) - f()) / h
10.000000000331966
We will finally redraw the manually updated computation graph.
draw_dot(L)
It basically implies that, for example, changing the value of $a$ by $1.0$ unit (from $5$ to $6$) will increase the value of $L$ by $10$ units (from $80$ to $90$).
f(ha=1), f(hb=1), f(hc=1), f(hd=1)
(90, 90, 90, 88)
Gradient Descent¶
What we saw above was one backward pass done manually. We are mainly interested in the signs of partial derivatives to know if they are positively or negatively influencing the eventual loss $L$ of our model. In our case, all the derivatives are positive and influence loss positively. We have to simply nudge the values in the opposite direction of the gradient to bring the loss down. This is known as gradient descent.
$$ \theta \leftarrow \theta - \eta \, \nabla_\theta L $$
Here, $ \theta $ denotes a model parameter, $ \nabla_\theta L $ is the gradient of the loss wrt that parameter, and $ \eta $ is the learning rate. The learning rate controls how large each update step is during gradient descent. If the learning rate is too small, the parameters change very slowly and training takes a long time. If the learning rate is too large, the updates can overshoot the minimum and cause the loss to increase or oscillate. We will discuss learning rate in a greater detail in the future when discussing optimization.
lr = 0.01
a.data -= lr * a.grad
b.data -= lr * b.grad
d.data -= lr * d.grad
# we skip c which is controlled by the values of a and b
# pay attention that the rest are leaf nodes in the computation graph
Note
In case the loss is a negative value (not common), we will need to gradient ascend the loss upwards towards zero and change the sign to += from -=. Note that the values of parameters (a, b, d) can decrease or increase depending on the sign of grad.
Forward Pass¶
We will now do a single forward pass to see if loss has been decreased. Recall that the previous loss was 80.
c = a + b
L = c * d
L.data
77.376
It seems like we optimized our values and brought down the loss.
Backward Pass¶
Manually calculating gradient is good only for educational purposes. We should implement automatic backward pass which will calculate gradients. To support this, we will rewrite our Value class to store a _backward function. This function will enforce the local gradient associated with the operation that produced the value. Initially, _backward does nothing.
For addition, the local derivatives are: $ \frac{\partial c}{\partial a} = 1 $ and $ \frac{\partial c}{\partial b} = 1 $. Hence, the gradient flowing into $c$ (stored in out.grad) is passed unchanged to both inputs:
a.grad += out.gradb.grad += out.grad
This is why, in the _backward function for addition, both operands receive the same gradient contribution.
For multiplication, the local derivatives are: $ \frac{\partial c}{\partial a} = b $ and $ \frac{\partial c}{\partial b} = a $. During backpropagation, the incoming gradient is scaled by the opposite operand:
a.grad += b.data * out.gradb.grad += a.data * out.grad
This reflects the chain rule: each variable’s gradient depends on how the output changes with respect to that variable. By storing a _backward function at every node and calling these functions in reverse topological order, the full gradient of the loss with respect to all intermediate values can be computed automatically.
class Value:
def __init__(self, data, _prev=(), _op='', label=''):
self.data = data
self._prev = _prev
self._op = _op
self.label = label
self.grad = 0.0
self._backward = lambda: None # initially it is a function which does nothing
def __add__(self, other):
data = self.data + other.data
out = Value(data, (self, other), '+')
def _backward():
self.grad = 1.0 * out.grad
other.grad = 1.0 * out.grad
out._backward = _backward
return out
def __mul__(self, other):
data = self.data * other.data
out = Value(data, (self, other), "*")
def _backward():
self.grad = other.data * out.grad
other.grad = self.data * out.grad
out._backward = _backward
return out
def __repr__(self):
return f"Value(data={self.data}, grad={self.grad})"
# Recreating the same function
a = Value(5, label='a')
b = Value(3, label='b')
c = a + b; c.label = 'c'
d = Value(10, label='d')
L = c * d; L.label = 'L'
draw_dot(L)
We should not forget to initialize the gradient of the loss to be $1.0$ and then call backward function. With correct implementation, we will get the same results which we manually calculated previously.
L.grad = 1.0
L._backward()
c._backward()
draw_dot(L)
Exercise
Make sure that all operations and their partial derivatives can be calculated (e.g. division, power).
Model Training with Backpropagation¶
We can now train the model by repeatedly performing a forward pass, a backward pass, and an optimization (gradient descent) step to reduce the loss. The backward pass, which computes gradients by propagating them through the whole computation graph, is called backpropagation.
# optimization
lr = 0.05
a.data -= lr * a.grad
b.data -= lr * b.grad
d.data -= lr * d.grad
# forward pass
c = a + b
L = c * d
# backward pass
L.grad = 1.0
L._backward()
c._backward()
L.data # loss
67.2
We have now trained the model for a single epoch. Even though what we do is oversimplistic and not precise, the main intuition and concepts behind training a neural network will be the same. We will now train the model for multiple epochs until we reduce the loss down to zero.
while True:
# optimization
a.data -= lr * a.grad
b.data -= lr * b.grad
d.data -= lr * d.grad
# forward pass
c = a + b
L = c * d
# backward pass
L.grad = 1.0
L._backward()
c._backward()
if L.data < 0:
break
print(f'Loss: {round(L.data,2)}')
Loss: 55.87 Loss: 45.77 Loss: 36.68 Loss: 28.42 Loss: 20.81 Loss: 13.69 Loss: 6.91 Loss: 0.34
PyTorch Implementation¶
Tip
PyTorch is preinstalled and configured for Google Colab. For local development, PyTorch can be installed via pip install torch. If you plan to use a GPU, make sure the installed PyTorch version matches your CUDA setup.
As mentioned in the beginning, our manual implementation is built into PyTorch. We will do a forward and backward pass with the help of autograd engine and check if the gradients are what we had previosuly calculated. As gradients need not to be calculated for, say, leaf nodes, to optimizate calculation further, requires_grad is set to False by default, which we need to update.
import torch
a = torch.tensor(5.0); a.requires_grad = True
b = torch.tensor(3.0); b.requires_grad = True
c = a + b
d = torch.tensor(10.0); d.requires_grad = True
L = c * d
L.backward()
a.grad, b.grad, d.grad
(tensor(10.), tensor(10.), tensor(8.))
This notebook introduced the core ideas behind gradient-based learning. It began with differentiation and extended the concept to partial derivatives for multivariable functions. Computation graphs were then introduced to represent how values depend on one another and to make gradient flow explicit. Using this structure, the gradient and the chain rule were used to explain how gradients were propagated backward from the loss to earlier variables. Gradient descent was presented as the mechanism for updating parameters in the direction that reduces the loss. The roles of the forward pass and backward pass were clarified and combined to describe model training with backpropagation. Finally, these ideas were demonstrated through a PyTorch implementation using automatic differentiation. In the next notebook, a neural network will be trained using a more advanced training engine.