01. From Derivatives to Backpropagation¶

25 Jan 2025 /
  2 Feb 2026

Tip

It will be helpful to revise the necessary Calculus needed for deep learning for understanding the material that follows.

We will go from illustrating differentiation and finding derivatives in Python, all the way down till the implementation of the backpropagation algorithm. Even if mathematically the idea of derivatives is very familiar, it still makes sense to see its various visualization plots via code.

Differentiation¶

The code below shows the function $f(x) = x^2$ in action. Our goal is to see how much the output changes as we modify the input by some value h. You can modify and see behavior for more values on other functions as well.

In [1]:

Copied!

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:

Copied!

def f(x):
  return x**2
def f(x):
  return x**2

In [3]:

Copied!

x = 3.0
for h in [10, 1, 0.1, 0]:
  print(f"If we shift input by {h}, output becomes {f(x+h)}")
x = 3.0
for h in [10, 1, 0.1, 0]:
  print(f"If we shift input by {h}, output becomes {f(x+h)}")

If we shift input by 10, output becomes 169.0
If we shift input by 1, output becomes 16.0
If we shift input by 0.1, output becomes 9.610000000000001
If we shift input by 0, output becomes 9.0

The well-known equation for the change $ \Delta y = f(x + \Delta x) - f(x)$ we can code and visualize for various cases as follows. Again, feel free to modify the values and functions to observe the behavior of change.

In [4]:

Copied!





h = 1.0

dx = h
dy = f(x+h) - f(x)

print(f"Δx: {dx}")
print(f"Δy: {dy}")
print(f"When you change x by {dx} unit, y changes by {dy} units.")
h = 1.0

dx = h
dy = f(x+h) - f(x)

print(f"Δx: {dx}")
print(f"Δy: {dy}")
print(f"When you change x by {dx} unit, y changes by {dy} units.")

Δx: 1.0
Δy: 7.0
When you change x by 1.0 unit, y changes by 7.0 units.

In [5]:

Copied!





def plot_delta(x, h, start=-4, stop=4, num=30):
  # `np.linspace` returns an array of num inputs within a range.
  x_all = np.linspace(start, stop, num)
  y_all = f(x_all)

  plt.figure(figsize=(4, 4))
  plt.plot(x_all, y_all)

  # dx & dy
  plt.plot([x, x + h], [f(x), f(x)], color='r')
  plt.plot([x + h, x + h], [f(x), f(x + h)], color='r')
def plot_delta(x, h, start=-4, stop=4, num=30):
  # `np.linspace` returns an array of num inputs within a range.
  x_all = np.linspace(start, stop, num)
  y_all = f(x_all)

  plt.figure(figsize=(4, 4))
  plt.plot(x_all, y_all)

  # dx & dy
  plt.plot([x, x + h], [f(x), f(x)], color='r')
  plt.plot([x + h, x + h], [f(x), f(x + h)], color='r')

In [6]:

Copied!

plot_delta(x=2, h=1)
plot_delta(x=2, h=1)

No description has been provided for this image

How to find if the ouput changes significantly when we change the input by some amount $h$? We should be familiar with the rate of change ratio when we choose a small step size $h$:

$$ \dfrac{\Delta y}{\Delta x} = \dfrac{f(x + h) - f(x)}{h}.$$

In [7]:

Copied!





def plot_roc(x, h):
  dx = h
  dy = f(x + h) - f(x)

  plot_delta(x, h)
  print(f"Rate of change is {dy / dx}")
def plot_roc(x, h):
  dx = h
  dy = f(x + h) - f(x)

  plot_delta(x, h)
  print(f"Rate of change is {dy / dx}")

In [8]:

Copied!

plot_roc(3, 1)
plot_roc(3, 1)

Rate of change is 7.0

In [9]:

Copied!

plot_roc(3, 0.5)
plot_roc(3, 0.5)

Rate of change is 6.5

In [10]:

Copied!

plot_roc(1, 1)
plot_roc(1, 1)

Rate of change is 3.0

In [11]:

Copied!

plot_roc(-2, 0.5)
plot_roc(-2, 0.5)

Rate of change is -3.5

The rate of change for different values of $h$ are different at the same point $x$. We would like to come up with a single value that would tell how significantly $y$ changes at a given point $x$ within the function:

$$ L = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} $$

Simply, this limit tells us how much the value of the output will change when we change the input by just a very small amount.

Note

Essentially, derivative is a function: it assigns to each input $ x $ the rate at which the original function changes at that point. Meaning, for each input, the limit above produces a value, and as $ x $ changes, this value changes as well.

In [12]:

Copied!





x = 3
h = 0.000001 # The limit of h approaching 0
d = (f(x + h) - f(x)) / h
f"The value of derivative function is {d}"
x = 3
h = 0.000001 # The limit of h approaching 0
d = (f(x + h) - f(x)) / h
f"The value of derivative function is {d}"

Out[12]:

'The value of derivative function is 6.000001000927568'

Partial Derivatives¶

When we extend this idea to multivariable calculus, we use partial derivatives. A partial derivative with respect to (wrt) a given variable measures how much the output changes when we nudge that variable alone by a very small amount, while keeping all other variables fixed. Below we will see how it behaves during addition $f(x, y)=x+y$.

In [13]:

Copied!

f = lambda x, y: x + y

x = 2
y = 3

f(x, y)
f = lambda x, y: x + y

x = 2
y = 3

f(x, y)

Out[13]:

Partial derivatives w.r.t $x$ an $y$ will be as follows:

In [14]:

Copied!

h = 0.000001
(f(x + h, y) - f(x, y)) / h
h = 0.000001
(f(x + h, y) - f(x, y)) / h

Out[14]:

1.000000000139778

In [15]:

Copied!

h = 0.000001
(f(x, y+h) - f(x, y)) / h
h = 0.000001
(f(x, y+h) - f(x, y)) / h

Out[15]:

1.000000000139778

No matter what the input values are, the expression will always approach $1.0$ for addition, because adding a constant increases the output by the same amount as the input change.

In [16]:

Copied!

for x, y in zip([-20, 2, 3], [300, 75, 10]):
  print(f'x={x}, y={y}: {(f(x + h, y) - f(x, y)) / h}')
for x, y in zip([-20, 2, 3], [300, 75, 10]):
  print(f'x={x}, y={y}: {(f(x + h, y) - f(x, y)) / h}')

x=-20, y=300: 0.9999999974752427
x=2, y=75: 0.9999999974752427
x=3, y=10: 1.0000000010279564

Indeed, if we have a simple addition $x + y$, then increasing $x$ or $y$ by some amount will increase the result by the exact same amount. Assertion will work for any number $h$ gets.

In [17]:

Copied!

h = 10
assert f(x+h, y) - f(x, y) == h
assert f(x, y+h) - f(x, y) == h
h = 10
assert f(x+h, y) - f(x, y) == h
assert f(x, y+h) - f(x, y) == h

Let's now see the case for multiplication $f(x, y)=x * y$.

In [18]:

Copied!

f = lambda x, y: x * y
f = lambda x, y: x * y

In [19]:

Copied!





x = 2
y = 3
h = 1e-5 # scientific notation for 0.00001
(f(x + h, y) - f(x, y)) / h # wrt x
x = 2
y = 3
h = 1e-5 # scientific notation for 0.00001
(f(x + h, y) - f(x, y)) / h # wrt x

Out[19]:

3.000000000064062

In [20]:

Copied!

for x in [-20, 2, 3]:
  print(f'x={x}, y={y}: {(f(x + h, y) - f(x, y)) / h}')
for x in [-20, 2, 3]:
  print(f'x={x}, y={y}: {(f(x + h, y) - f(x, y)) / h}')

x=-20, y=3: 2.999999999531155
x=2, y=3: 3.000000000064062
x=3, y=3: 3.000000000064062

In [21]:

Copied!





x = 10
h = 5
pdx = (f(x+h, y) - f(x, y)) / h
print(pdx, y)
assert round(pdx, 2) == round(y, 2)
x = 10
h = 5
pdx = (f(x+h, y) - f(x, y)) / h
print(pdx, y)
assert round(pdx, 2) == round(y, 2)

3.0 3

Finally, we will consider a complex function with three variables $f(x, y, z)=x^2+y^3-z$.

In [22]:

Copied!

def f(x, y, z):
  return x**2 + y**3 - z
def f(x, y, z):
  return x**2 + y**3 - z

In [23]:

Copied!

x = 2
y = 3
z = 4

f(x, y, z)
x = 2
y = 3
z = 4

f(x, y, z)

Out[23]:

In [24]:

Copied!

h = 1

f(x + h, y, z)
h = 1

f(x + h, y, z)

Out[24]:

In [25]:

Copied!

f(x + h, y, z) - f(x, y, z)
f(x + h, y, z) - f(x, y, z)

Out[25]:

In [26]:

Copied!

(f(x + h, y, z) - f(x, y, z)) / h
(f(x + h, y, z) - f(x, y, z)) / h

Out[26]:

5.0

In [27]:

Copied!

h = 0.00001 # change in limit
pdx = (f(x + h, y, z) - f(x, y, z)) / h
pdx
h = 0.00001 # change in limit
pdx = (f(x + h, y, z) - f(x, y, z)) / h
pdx

Out[27]:

4.000010000027032

In [28]:

Copied!

assert 2*x == round(pdx)
assert 2*x == round(pdx)

Partial derivative w.r.t $x$ is $2x$ (by the power rule), and for $x=2$ indeed we get $4$.

Execise

Code the partial derivative w.r.t $y$ and $z$ and verify if the result correct.

Computation Graph¶

Info

The following source was consulted in preparing this material: Andrej Karpathy's lecture on Micrograd.

Micrograd is an educational library that demonstrates the core ideas behind automatic differentiation. The idea behind it is the same one used in major deep learning frameworks. Micrograd closely mirrors the logic of the PyTorch autograd engine. In this and the following notebooks, we will modify micrograd further to improve correspondence with PyTorch and make the explanations clearer.

Tip

It is recommended to run this notebook on Google Colab, where Graphviz is available by default. For local development, graph visualization requires both the Python wrapper (pip install graphviz) and the Graphviz system executable to be installed and available on the system PATH.

In [29]:

Copied!





# This is a graph visualization code from micrograd, no need to understand the details
# https://github.com/karpathy/micrograd/blob/master/trace_graph.ipynb
from graphviz import Digraph

def trace(root):
    nodes, edges = set(), set()
    def build(v):
        if v not in nodes:
            nodes.add(v)
            for child in v._prev:
                edges.add((child, v))
                build(child)
    build(root)
    return nodes, edges

def draw_dot(root, format='svg', rankdir='LR'):
    """
    format: png | svg | ...
    rankdir: TB (top to bottom graph) | LR (left to right)
    """
    assert rankdir in ['LR', 'TB']
    nodes, edges = trace(root)
    dot = Digraph(format=format, graph_attr={'rankdir': rankdir}) #, node_attr={'rankdir': 'TB'})

    for n in nodes:
        dot.node(name=str(id(n)), label = "{ %s | data %.3f | grad %.3f }" % (n.label, n.data, n.grad), shape='record')
        if n._op:
            dot.node(name=str(id(n)) + n._op, label=n._op)
            dot.edge(str(id(n)) + n._op, str(id(n)))

    for n1, n2 in edges:
        dot.edge(str(id(n1)), str(id(n2)) + n2._op)

    return dot
# This is a graph visualization code from micrograd, no need to understand the details
# https://github.com/karpathy/micrograd/blob/master/trace_graph.ipynb
from graphviz import Digraph

def trace(root):
    nodes, edges = set(), set()
    def build(v):
        if v not in nodes:
            nodes.add(v)
            for child in v._prev:
                edges.add((child, v))
                build(child)
    build(root)
    return nodes, edges

def draw_dot(root, format='svg', rankdir='LR'):
    """
    format: png | svg | ...
    rankdir: TB (top to bottom graph) | LR (left to right)
    """
    assert rankdir in ['LR', 'TB']
    nodes, edges = trace(root)
    dot = Digraph(format=format, graph_attr={'rankdir': rankdir}) #, node_attr={'rankdir': 'TB'})

    for n in nodes:
        dot.node(name=str(id(n)), label = "{ %s | data %.3f | grad %.3f }" % (n.label, n.data, n.grad), shape='record')
        if n._op:
            dot.node(name=str(id(n)) + n._op, label=n._op)
            dot.edge(str(id(n)) + n._op, str(id(n)))

    for n1, n2 in edges:
        dot.edge(str(id(n1)), str(id(n2)) + n2._op)

    return dot

The purpose of the Value class below is to store both numerical values and the history of operations that produced them. Each instance represents a node in a [computation graph](computation graph). When a Value object is created, it stores:

the numerical result of a computation (data),
references to the input values that produced it (_prev),
the operation that combined those inputs (_op),
label provided by us in order to see it on the computation graph (label),
gradient of the final output w.r.t this value (grad).

Arithmetic operations such as addition and multiplication are overloaded so that, instead of returning plain numbers, they return new Value objects. These new objects remember which values were combined and how they were combined. As computations are chained together, this process builds a directed graph, later used to propagate gradients.

In [30]:

Copied!





# Value class stores a number and "remembers" information about its origins
class Value:
  def __init__(self, data, _prev=(), _op='', label=''):
    self.data = data
    self._prev = _prev
    self._op = _op
    self.label = label
    self.grad = 0

  def __add__(self, other):
    data = self.data + other.data
    out = Value(data, (self, other), '+')
    return out

  def __mul__(self, other):
    data = self.data * other.data
    out = Value(data, (self, other), "*")
    return out

  def __sub__(self, other):
    return self + (Value(-1) * other) # self + (-other)

  def __repr__(self):
    return f"Value(data={self.data}, grad={self.grad})"
# Value class stores a number and "remembers" information about its origins
class Value:
  def __init__(self, data, _prev=(), _op='', label=''):
    self.data = data
    self._prev = _prev
    self._op = _op
    self.label = label
    self.grad = 0

  def __add__(self, other):
    data = self.data + other.data
    out = Value(data, (self, other), '+')
    return out

  def __mul__(self, other):
    data = self.data * other.data
    out = Value(data, (self, other), "*")
    return out

  def __sub__(self, other):
    return self + (Value(-1) * other) # self + (-other)

  def __repr__(self):
    return f"Value(data={self.data}, grad={self.grad})"

Below is the example of a simple expression built with the help of Value class: $L=c*d$ where $c=a+b$.

In [31]:

Copied!





a = Value(5, label='a')
b = Value(3, label='b')
c = a + b; c.label = 'c'
d = Value(10, label='d')
L = c * d; L.label = 'L'
a = Value(5, label='a')
b = Value(3, label='b')
c = a + b; c.label = 'c'
d = Value(10, label='d')
L = c * d; L.label = 'L'

In [32]:

Copied!

print(a, a._prev)
print(L, L._prev)
print(a, a._prev)
print(L, L._prev)

Value(data=5, grad=0) ()
Value(data=80, grad=0) (Value(data=8, grad=0), Value(data=10, grad=0))

We can use the helper function built with graphviz above to plot our computation graph as a nice visualization.

In [33]:

Copied!

draw_dot(c)
draw_dot(c)

Out[33]:

In [34]:

Copied!

draw_dot(L)
draw_dot(L)

Out[34]:

Gradient¶

Gradient is vector of partial derivatives. We want to know how much changing each variable will affect the output of L (loss). We will store those partial derivatives inside each grad variable of each Value object. We start by noting that the derivative of a variable with respect to itself is $1$ (you get the same $dx/dy$).

In [35]:

Copied!





L.grad = 1.0

f = lambda x: x
h = 1e-5
pdx = (f(x + h) - f(x)) / h
assert round(pdx) == 1
L.grad = 1.0

f = lambda x: x
h = 1e-5
pdx = (f(x + h) - f(x)) / h
assert round(pdx) == 1

Now let's see how changing other variables will affect the eventual result.

In [36]:

Copied!





# extended version of the function we saw previously
def f(ha=0, hb=0, hc=0, hd=0):
  a = Value(5 + ha, label='a')
  b = Value(3 + hb, label='b')
  c = a + b + Value(hc); c.label = 'c'
  d = Value(10 + hd, label='d')
  L = c * d; L.label = 'L'
  return L.data
# extended version of the function we saw previously
def f(ha=0, hb=0, hc=0, hd=0):
  a = Value(5 + ha, label='a')
  b = Value(3 + hb, label='b')
  c = a + b + Value(hc); c.label = 'c'
  d = Value(10 + hd, label='d')
  L = c * d; L.label = 'L'
  return L.data

In [37]:

Copied!

h = 1e-5
(f(hd=h) - f()) / h
h = 1e-5
(f(hd=h) - f()) / h

Out[37]:

7.999999999697137

From the computational graph we can also know that $L=c*d$. When we change the value of $d$ just a little bit (derivative of $L$ wrt $d$) the value of $L$ will change by the amount of $c$, which is $8.0$. We saw it above in the partial derivative of a multiplication.

In [38]:

Copied!

d.grad = c.data
c.grad = d.data
d.grad = c.data
c.grad = d.data

With the same logic, the derivative of $L$ wrt $c$ will be the value of $d$, which is $10.0$ in our specific case. We can verify it.

In [39]:

Copied!

(f(hc=h) - f()) / h
(f(hc=h) - f()) / h

Out[39]:

10.000000000331966

Chain Rule¶

To determine how much changing earlier variables in the computation graph will affect the loss, we can apply the chain rule. Simply, the derivative of $L$ wrt $a$ is the derivative of $c$ wrt $a$ multiplied by the derivative of $L$ wrt $c$:

$$ \frac{dL}{da} = \frac{dL}{dc} \cdot \frac{dc}{da}. $$

The derivate of $c$ both wrt $a$ and $b$ is $1.0$ due to the property of addition we had seen previously. From here:

In [40]:

Copied!

a.grad = 1.0 * c.grad
b.grad = 1.0 * c.grad

a.grad, b.grad
a.grad = 1.0 * c.grad
b.grad = 1.0 * c.grad

a.grad, b.grad

Out[40]:

(10.0, 10.0)

We can verify it as well. Let's see how much $L$ gets affected, when we shift $a$ or $b$ by a very small amount.

In [41]:

Copied!

(f(ha=h) - f()) / h
(f(ha=h) - f()) / h

Out[41]:

10.000000000331966

In [42]:

Copied!

(f(hb=h) - f()) / h
(f(hb=h) - f()) / h

Out[42]:

10.000000000331966

We will finally redraw the manually updated computation graph.

In [43]:

Copied!

draw_dot(L)
draw_dot(L)

Out[43]:

It basically implies that, for example, changing the value of $a$ by $1.0$ unit (from $5$ to $6$) will increase the value of $L$ by $10$ units (from $80$ to $90$).

In [44]:

Copied!

f(ha=1), f(hb=1), f(hc=1), f(hd=1)
f(ha=1), f(hb=1), f(hc=1), f(hd=1)

Out[44]:

(90, 90, 90, 88)

Gradient Descent¶

What we saw above was one backward pass done manually. We are mainly interested in the signs of partial derivatives to know if they are positively or negatively influencing the eventual loss $L$ of our model. In our case, all the derivatives are positive and influence loss positively. We have to simply nudge the values in the opposite direction of the gradient to bring the loss down. This is known as gradient descent.

$$ \theta \leftarrow \theta - \eta \, \nabla_\theta L $$

Here, $ \theta $ denotes a model parameter, $ \nabla_\theta L $ is the gradient of the loss wrt that parameter, and $ \eta $ is the learning rate. The learning rate controls how large each update step is during gradient descent. If the learning rate is too small, the parameters change very slowly and training takes a long time. If the learning rate is too large, the updates can overshoot the minimum and cause the loss to increase or oscillate. We will discuss learning rate in a greater detail in the future when discussing optimization.

In [45]:

Copied!





lr = 0.01

a.data -= lr * a.grad
b.data -= lr * b.grad
d.data -= lr * d.grad

# we skip c which is controlled by the values of a and b
# pay attention that the rest are leaf nodes in the computation graph
lr = 0.01

a.data -= lr * a.grad
b.data -= lr * b.grad
d.data -= lr * d.grad

# we skip c which is controlled by the values of a and b
# pay attention that the rest are leaf nodes in the computation graph

Note

In case the loss is a negative value (not common), we will need to gradient ascend the loss upwards towards zero and change the sign to += from -=. Note that the values of parameters (a, b, d) can decrease or increase depending on the sign of grad.

Forward Pass¶

We will now do a single forward pass to see if loss has been decreased. Recall that the previous loss was 80.

In [46]:

Copied!

c = a + b
L = c * d

L.data
c = a + b
L = c * d

L.data

Out[46]:

77.376

It seems like we optimized our values and brought down the loss.

Backward Pass¶

Manually calculating gradient is good only for educational purposes. We should implement automatic backward pass which will calculate gradients. To support this, we will rewrite our Value class to store a _backward function. This function will enforce the local gradient associated with the operation that produced the value. Initially, _backward does nothing.

For addition, the local derivatives are: $ \frac{\partial c}{\partial a} = 1 $ and $ \frac{\partial c}{\partial b} = 1 $. Hence, the gradient flowing into $c$ (stored in out.grad) is passed unchanged to both inputs:

a.grad += out.grad
b.grad += out.grad

This is why, in the _backward function for addition, both operands receive the same gradient contribution.

For multiplication, the local derivatives are: $ \frac{\partial c}{\partial a} = b $ and $ \frac{\partial c}{\partial b} = a $. During backpropagation, the incoming gradient is scaled by the opposite operand:

a.grad += b.data * out.grad
b.grad += a.data * out.grad

This reflects the chain rule: each variable’s gradient depends on how the output changes with respect to that variable. By storing a _backward function at every node and calling these functions in reverse topological order, the full gradient of the loss with respect to all intermediate values can be computed automatically.

In [47]:

Copied!





class Value:
  def __init__(self, data, _prev=(), _op='', label=''):
    self.data = data
    self._prev = _prev
    self._op = _op
    self.label = label
    self.grad = 0.0
    self._backward = lambda: None # initially it is a function which does nothing

  def __add__(self, other):
    data = self.data + other.data
    out = Value(data, (self, other), '+')

    def _backward():
      self.grad = 1.0 * out.grad
      other.grad = 1.0 * out.grad
    out._backward = _backward

    return out

  def __mul__(self, other):
    data = self.data * other.data
    out = Value(data, (self, other), "*")

    def _backward():
      self.grad = other.data * out.grad
      other.grad = self.data * out.grad
    out._backward = _backward

    return out

  def __sub__(self, other):
    return self + (Value(-1) * other)

  def __repr__(self):
    return f"Value(data={self.data}, grad={self.grad})"
class Value:
  def __init__(self, data, _prev=(), _op='', label=''):
    self.data = data
    self._prev = _prev
    self._op = _op
    self.label = label
    self.grad = 0.0
    self._backward = lambda: None # initially it is a function which does nothing

  def __add__(self, other):
    data = self.data + other.data
    out = Value(data, (self, other), '+')

    def _backward():
      self.grad = 1.0 * out.grad
      other.grad = 1.0 * out.grad
    out._backward = _backward

    return out

  def __mul__(self, other):
    data = self.data * other.data
    out = Value(data, (self, other), "*")

    def _backward():
      self.grad = other.data * out.grad
      other.grad = self.data * out.grad
    out._backward = _backward

    return out

  def __sub__(self, other):
    return self + (Value(-1) * other)

  def __repr__(self):
    return f"Value(data={self.data}, grad={self.grad})"

In [48]:

Copied!





# Recreating the same function
a = Value(5, label='a')
b = Value(3, label='b')
c = a + b; c.label = 'c'
d = Value(10, label='d')
L = c * d; L.label = 'L'
# Recreating the same function
a = Value(5, label='a')
b = Value(3, label='b')
c = a + b; c.label = 'c'
d = Value(10, label='d')
L = c * d; L.label = 'L'

In [49]:

Copied!

draw_dot(L)
draw_dot(L)

Out[49]:

We should not forget to initialize the gradient of the loss to be $1.0$ and then call backward function. With correct implementation, we will get the same results which we manually calculated previously.

In [50]:

Copied!

L.grad = 1.0
L._backward()
c._backward()
L.grad = 1.0
L._backward()
c._backward()

In [51]:

Copied!

draw_dot(L)
draw_dot(L)

Out[51]:

Exercise

Make sure that all operations and their partial derivatives can be calculated (e.g. division, power).

Training Model¶

We can now train the model by repeatedly performing a forward pass, a backward pass, and an optimization (gradient descent) step to reduce the loss. The term backpropagation is often used informally to describe the entire training step of a neural network, but this is a convention rather than a precise definition. Strictly speaking, backpropagation refers only to the backward pass, where gradients of the loss with respect to all intermediate values and parameters are computed using the chain rule.

In [52]:

Copied!





# optimization
lr = 0.05
a.data -= lr * a.grad
b.data -= lr * b.grad
d.data -= lr * d.grad

# forward pass
c = a + b
L = c * d

# backward pass
L.grad = 1.0
L._backward()
c._backward()

L.data # loss
# optimization
lr = 0.05
a.data -= lr * a.grad
b.data -= lr * b.grad
d.data -= lr * d.grad

# forward pass
c = a + b
L = c * d

# backward pass
L.grad = 1.0
L._backward()
c._backward()

L.data # loss

Out[52]:

67.2

We have now trained the model for a single epoch. Even though what we do is oversimplistic and not precise, the main intuition and concepts behind training a neural network will be the same. We will now train the model for multiple epochs until we reduce the loss down to zero.

In [53]:

Copied!





while True:
  # optimization
  a.data -= lr * a.grad
  b.data -= lr * b.grad
  d.data -= lr * d.grad

  # forward pass
  c = a + b
  L = c * d

  # backward pass
  L.grad = 1.0
  L._backward()
  c._backward()

  if L.data < 0:
    break

  print(f'Loss: {round(L.data,2)}')
while True:
  # optimization
  a.data -= lr * a.grad
  b.data -= lr * b.grad
  d.data -= lr * d.grad

  # forward pass
  c = a + b
  L = c * d

  # backward pass
  L.grad = 1.0
  L._backward()
  c._backward()

  if L.data < 0:
    break

  print(f'Loss: {round(L.data,2)}')

Loss: 55.87
Loss: 45.77
Loss: 36.68
Loss: 28.42
Loss: 20.81
Loss: 13.69
Loss: 6.91
Loss: 0.34

Backpropagation with PyTorch¶

Tip

PyTorch is preinstalled and configured for Google Colab. For local development, PyTorch can be installed via pip install torch. If you plan to use a GPU, make sure the installed PyTorch version matches your CUDA setup.

As mentioned in the beginning, our manual implementation is built into PyTorch. We will do a forward and backward pass with the help of autograd engine and check if the gradients are what we had previosuly calculated. As gradients need not to be calculated for, say, leaf nodes, to optimizate calculation further, requires_grad is set to False by default, which we need to update.

In [54]:

Copied!





import torch

a = torch.tensor(5.0);    a.requires_grad = True
b = torch.tensor(3.0);    b.requires_grad = True
c = a + b
d = torch.tensor(10.0);   d.requires_grad = True
L = c * d
import torch

a = torch.tensor(5.0);    a.requires_grad = True
b = torch.tensor(3.0);    b.requires_grad = True
c = a + b
d = torch.tensor(10.0);   d.requires_grad = True
L = c * d

In [55]:

Copied!

L.backward()
L.backward()

In [56]:

Copied!

a.grad, b.grad, d.grad
a.grad, b.grad, d.grad

Out[56]:

(tensor(10.), tensor(10.), tensor(8.))

Due to practical needs, we will go further in our PyTorch implementation and train a model. See the code below and pay attention to requires_grad, no_grad(), and zero_() functions (we will implement them as well in our next notebook on neural networks)

The learnable parameters are created with requires_grad=True, which tells PyTorch to track all operations on them and build a computation graph automatically. This replaces the manual _backward logic used earlier.

At the beginning of each epoch, gradients must be cleared using zero_(). PyTorch accumulates gradients by default, so failing to reset them would mix gradients from different iterations.

The forward pass computes the loss using standard operations. Calling backward() on the loss automatically computes all gradients via automatic differentiation. There is no need to call backward functions on intermediate values.

Parameter updates are performed inside a no_grad() block. This prevents PyTorch from tracking the update operations themselves, ensuring that gradient history does not grow across epochs. For large networks, not using torch.no_grad() may strain resources.

In [57]:

Copied!





a = torch.tensor(5.0, requires_grad = True);
b = torch.tensor(3.0, requires_grad = True);
c = a + b
d = torch.tensor(10.0,requires_grad = True);
L = c * d

# hyperparameters
lr = 0.03
epochs = 10

for _ in range(epochs):
  # backward pass
  L.backward()

  # optimization (gradient descent)
  with torch.no_grad():
    a -= lr * a.grad
    b -= lr * b.grad
    d -= lr * d.grad

  # avoids accumulating gradients
  # comment the code below out to see how it affects the learning
  a.grad.zero_()
  b.grad.zero_()
  d.grad.zero_()

  # forward pass
  c = a + b
  L = c * d

  print(f'Loss: {L.data:.2f}')
a = torch.tensor(5.0, requires_grad = True);
b = torch.tensor(3.0, requires_grad = True);
c = a + b
d = torch.tensor(10.0,requires_grad = True);
L = c * d

# hyperparameters
lr = 0.03
epochs = 10

for _ in range(epochs):
  # backward pass
  L.backward()

  # optimization (gradient descent)
  with torch.no_grad():
    a -= lr * a.grad
    b -= lr * b.grad
    d -= lr * d.grad

  # avoids accumulating gradients
  # comment the code below out to see how it affects the learning
  a.grad.zero_()
  b.grad.zero_()
  d.grad.zero_()

  # forward pass
  c = a + b
  L = c * d

  print(f'Loss: {L.data:.2f}')

Loss: 72.22
Loss: 65.00
Loss: 58.26
Loss: 51.97
Loss: 46.08
Loss: 40.53
Loss: 35.30
Loss: 30.35
Loss: 25.63
Loss: 21.11

This notebook introduced the core ideas behind gradient-based learning. It began with differentiation and extended the concept to partial derivatives for multivariable functions. Computation graphs were then introduced to represent how values depend on one another and to make gradient flow explicit. Using this structure, the gradient and the chain rule were used to explain how gradients were propagated backward from the loss to earlier variables. Gradient descent was presented as the mechanism for updating parameters in the direction that reduces the loss. The roles of the forward pass and backward pass were clarified and combined to describe model training with backpropagation. Finally, these ideas were demonstrated through a PyTorch implementation using automatic differentiation. In the next notebook, a neural network will be trained using a more advanced training engine.