Python Code from Neuron to Neural Network

Feb 8, 2025 · 32 min read

The following material was initially prepared as a lecture for CSCI 4701: Deep Learning (Spring 2025) course at ADA University. The notebook is mainly based on Andrej Karpathy's lecture on Micrograd and is the continuation of my previous post called Python Code from Derivatives to Backpropagation.

Recall: Backpropagation
Activation Function
Artificial Neuron
N-dimensional Neuron
Artificial Neural Network
Iris Dataset
Training Custom MLP Classifier

# This is a graph visualization code from micrograd, no need to understand the details
# https://github.com/karpathy/micrograd/blob/master/trace_graph.ipynb
from graphviz import Digraph

def trace(root):
    nodes, edges = set(), set()
    def build(v):
        if v not in nodes:
            nodes.add(v)
            for child in v._prev:
                edges.add((child, v))
                build(child)
    build(root)
    return nodes, edges

def draw_dot(root, format='svg', rankdir='LR'):
    """
    format: png | svg | ...
    rankdir: TB (top to bottom graph) | LR (left to right)
    """
    assert rankdir in ['LR', 'TB']
    nodes, edges = trace(root)
    dot = Digraph(format=format, graph_attr={'rankdir': rankdir}) #, node_attr={'rankdir': 'TB'})

    for n in nodes:
        dot.node(name=str(id(n)), label = "{ %s | data %.3f | grad %.3f }" % (n.label, n.data, n.grad), shape='record')
        if n._op:
            dot.node(name=str(id(n)) + n._op, label=n._op)
            dot.edge(str(id(n)) + n._op, str(id(n)))

    for n1, n2 in edges:
        dot.edge(str(id(n1)), str(id(n2)) + n2._op)

    return dot

Recall: Backpropagation

# This is the code from previous lecture
# It is important to understand _backward function
class Value:
  def __init__(self, data, _prev=(), _op='', label=''):
    self.data = data
    self._prev = _prev
    self._op = _op
    self.label = label
    self._backward = lambda: None
    self.grad = 0.0

  def __add__(self, other):
    data = self.data + other.data
    out = Value(data, (self, other),'+')

    def _backward():
      self.grad = 1.0 * out.grad
      other.grad = 1.0 * out.grad
    out._backward = _backward

    return out

  def __mul__(self, other):
    data = self.data * other.data
    out = Value(data, (self, other),'*')

    def _backward():
      self.grad = other.data * out.grad
      other.grad = self.data * out.grad
    out._backward = _backward

    return out

  def __sub__(self, other):
    return self + (Value(-1) * other) # self + (-other)

  def __repr__(self):
    return f'{self.label}: {self.data}'

a = Value(5, label='a')
b = Value(3, label='b')
c = a + b;   c.label = 'c'
d = Value(10, label='d')
L = c * d;   L.label = 'L'

draw_dot(L)

computation graph

epochs = 10
learning_rate = 0.01

for _ in range(epochs):
  L.grad = 1.0

  # backward pass
  L._backward()
  c._backward()

  # optimization (gradient descent)
  a.data -= learning_rate * a.grad
  b.data -= learning_rate * b.grad
  d.data -= learning_rate * d.grad

  # forward pass
  c = a + b
  L = c * d

  print(f'Loss: {L.data:.2f}')

# output
Loss: 77.38
Loss: 74.81
Loss: 72.31
Loss: 69.87
Loss: 67.49
Loss: 65.16
Loss: 62.88
Loss: 60.66
Loss: 58.48
Loss: 56.35

# Equivalent implementation in PyTorch
# pay attention to requires_grad, no_grad() and zero_()

import torch

a = torch.tensor(5.0, requires_grad = True);
b = torch.tensor(3.0, requires_grad = True);
c = a + b
d = torch.tensor(10.0,requires_grad = True);
L = c * d

for _ in range(epochs):
  # backward pass
  L.backward()

  # optimization (gradient descent)
  with torch.no_grad():
    a -= learning_rate * a.grad
    b -= learning_rate * b.grad
    d -= learning_rate * d.grad

  # avoids accumulating gradients
  # comment this out to see how it affects the learning
  a.grad.zero_()
  b.grad.zero_()
  d.grad.zero_()

  # forward pass
  c = a + b
  L = c * d

  print(f'Loss: {L.data:.2f}')

# output
Loss: 77.38
Loss: 74.81
Loss: 72.31
Loss: 69.87
Loss: 67.49
Loss: 65.16
Loss: 62.88
Loss: 60.66
Loss: 58.48
Loss: 56.35

Activation Function

The function f(x) = x * w is a linear function always passing from origin. The real world data, however, will be much more complex, and in order to describe a pattern in the data our Machine Learning model should return a more flexible function. For that, we will do two things: add bias b and bring non-linearity with an activation function. We can choose different non-linear activation functions with the condition that it should be differentiable (otherwise we won’t be able to calculate gradients for backpropagation). We will implement sigmoid (logistic) activation function which has the following formula:

$sigmoid$

Sigmoid function not only makes a linear function non-linear and continuous, but also maps any value of x to be between 0 and 1. It may be useful when we want to predict probabilities for different output classes.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def sigmoid(x):
  return 1.0 / (1 + np.exp(-x)) # see formula above

# a simple linear function where activation will be applied
def f(x, w=0.5, b=10, activation=None):
  out = x * w + b
  return activation(out) if activation else out

def plot(f, x, activation=None):
  plt.figure(figsize=(4, 4))
  x_all = np.linspace(-50, 50, 100)
  y_all = f(x_all, activation=activation)
  plt.plot(x_all, y_all)
  plt.scatter(x, f(x, activation=activation), color='r')
  plt.show()

x = -20
plot(f, x)

linear function

Now we will plot the exact same point mapped into the non-linear function between 0 and 1. Try out different x values and see the plots.

plot(f, x, sigmoid)

sigmoid activated function

Exercise: Implement other activation functions (e.g. tanh, relu) and see the plot.

Exercise: What could be the distadvantage of using sigmoid activation function?

Artificial Neuron

Articial Neuron — Funcs - Own work (CC0 | Wikimedia Commons)

An artificial neuron is simply a linear function passing through an activation function (e.g. sigmoid(x * w + b)). The illustration above describes an N-dimensional neuron, accepting inputs between x₁ ... x_n. The function f we had above is a very simple neuron with 1-dimensional input.

Question: What could be input values for predicting the probability of a customer cancelling their subscription?

Before creating our neuron, we will first make some updates to the Value class.

Not only Value class should have a sigmoid(x) function, but also it should be able to calculate a derivative for it.

Exercise: Find the derivative of the sigmoid function:

$sigmoid$

The requires_grad flag (similar to PyTorch) will tell which parameters are trainable and requires gradient calculation and update (Note that this feature is not implemented in micrograd).

For example, it doesn’t make sense to modify the real-life training inputs x1 and x2 for our neuron. We shouldn’t spend resources for calculating unnecessary gradients. Our goal is to nudge only the weight and bias (i.e. parameter) values, as well as the nodes dependent on them, in order to minimize the eventual loss.

class Value:
  def __init__(self, data, _prev=(), _op='', requires_grad=False, label=''):
    self.data = data
    self._prev = _prev
    self._op = _op
    self.label = label
    self._backward = lambda: None
    self.grad = 0.0
    self.requires_grad = requires_grad

  def __add__(self, other):
    data = self.data + other.data
    out = Value(data, (self, other), '+', self.requires_grad or other.requires_grad)

    def _backward():
      if self.requires_grad:
        self.grad = 1.0 * out.grad
      if other.requires_grad:
        other.grad = 1.0 * out.grad
    out._backward = _backward

    return out

  def __mul__(self, other):
    data = self.data * other.data
    out = Value(data, (self, other), '*', self.requires_grad or other.requires_grad)

    def _backward():
      if self.requires_grad:
        self.grad = other.data * out.grad
      if other.requires_grad:
        other.grad = self.data * out.grad
    out._backward = _backward

    return out

  def __sub__(self, other):
    return self + (Value(-1) * other) # self + (-other)

  def sigmoid(self):
    s = 1.0 / (1 + np.exp(-self.data))
    out = Value(s, (self, ), 'sigmoid', self.requires_grad)

    def _backward():
      if self.requires_grad:
        self.grad = s * (1.0 - s) * out.grad
    out._backward = _backward

    return out

  def __repr__(self):
    return f'Value({self.data:.4f})'

We will initially implement a simple Neuron class in 3D (2-dimensional input values and an output value). The function will have two inputs x1 and x2, which will become Value objects. Their weights w1 and w2 will determine how much input (e.g. age of a customer) influences outcome.

from mpl_toolkits.mplot3d import Axes3D

class Neuron:
  def __init__(self):
    self.w1 = Value(np.random.uniform(-1, 1), label='w1', requires_grad=True)
    self.w2 = Value(np.random.uniform(-1, 1), label='w2', requires_grad=True)
    self.b = Value(0, label='b', requires_grad=True)

  def __call__(self, x1, x2):
    out = x1 * self.w1 + x2 * self.w2 + self.b
    return out.sigmoid()

  # this code here is for plotting, no need to understand, works for only 3D
  def plot(self):
    x1_vals = np.linspace(-5, 5, 100)
    x2_vals = np.linspace(-5, 5, 100)
    X1, X2 = np.meshgrid(x1_vals, x2_vals)
    Z = np.zeros_like(X1)

    for i in range(X1.shape[0]):
      for j in range(X1.shape[1]):
        x1 = Value(X1[i, j])
        x2 = Value(X2[i, j])
        output = self(x1, x2)
        Z[i, j] = output.data

    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, projection='3d')
    ax.plot_surface(X1, X2, Z, cmap='viridis')
    ax.set_title(f'Neuron Output with Sigmoid Activation)')
    plt.show()

Now we can initialize our inputs and neuron to see our computation graph. Our loss will be simple: the ground truth label y minus the predicted probability. Let’s assume that, our input values x1 and x2 correspond to a customer who made the purchase (y = 1). We will try out both activation functions and see their plots.

x1 = Value(2, label='x1')
x2 = Value(3, label='x2')
y  = Value(1, label= 'y')

n = Neuron()

pred = n(x1, x2);      pred.label = 'pred'
L = y - pred;          L.label = 'loss'

draw_dot(L)

Computation Graph

n.plot()

2D Activated Neuron Function Plot

The ground truth label 1 tells us that we should push our probability towards 1.0. In other words, as our loss L here is a simple error value corresponding to 1 - prob, we should try to minimize the loss down to zero with backpropagation. However, our computation graph is bigger than how it was before. The _backward pass function we call manually on each node is not scalable. Ideally, we should have a single function backward() to calculate all the gradients, which we previously saw in the PyTorch implementation. For that, we will need to sort the nodes of the computation graph (in this case, from input/weight nodes until the probability node). We can achieve that with a topological sort function implemented for micrograd.

topo = []
visited = set()
def build_topo(v):
  if v not in visited:
    visited.add(v)
    for child in v._prev:
      build_topo(child)
    topo.append(v)
build_topo(pred)
topo

We will integrate topological sort into our Value object and implement complete backward pass. We can also add a simple gradient descent function optimize() which will use this topology. Finally, instead of overriding gradients (=), we will accumulate them (+=) to avoid gradient update bugs when using the same node more than once in an operation. And as a consequence, we will have to reset gradients with zero_() (similar to PyTorch) so that the gradients of different backward passes will not affect each other (it does the exact same thing as self.grad = 0.0 was doing before gradient accumulation). Although, to be precise, zero_() function should reset only the gradient of self, and it is actually a function called zero_grad() of optimizer in PyTorch which resets gradients accross all nodes.

class Value:
  def __init__(self, data, _prev=(), _op='', requires_grad=False, label=''):
    self.data = data
    self._prev = _prev
    self._op = _op
    self.label = label
    self._backward = lambda: None
    self.grad = 0.0
    self.requires_grad = requires_grad
    self.topo = self.build_topo()
    self.params = [node for node in self.topo if node.requires_grad]

  def __add__(self, other):
    data = self.data + other.data
    out = Value(data, (self, other), '+', self.requires_grad or other.requires_grad)

    def _backward():
      if self.requires_grad:
        self.grad += 1.0 * out.grad
      if other.requires_grad:
        other.grad += 1.0 * out.grad
    out._backward = _backward

    return out

  def __mul__(self, other):
    data = self.data * other.data
    out = Value(data, (self, other), '*', self.requires_grad or other.requires_grad)

    def _backward():
      if self.requires_grad:
        self.grad += other.data * out.grad
      if other.requires_grad:
        other.grad += self.data * out.grad
    out._backward = _backward

    return out

  def __sub__(self, other):
    return self + (Value(-1) * other) # self + (-other)

  def sigmoid(self):
    s = 1.0 / (1 + np.exp(-self.data))
    out = Value(s, (self, ), 'sigmoid', self.requires_grad)

    def _backward():
      if self.requires_grad:
        self.grad += s * (1.0 - s) * out.grad
    out._backward = _backward

    return out

  def build_topo(self):
    # topological order all of the children in the graph
    topo = []
    visited = set()

    def _build_topo(node):
      if node not in visited:
        visited.add(node)
        for child in node._prev:
          _build_topo(child)
        topo.append(node)
    _build_topo(self)

    return topo

  def backward(self):
    if self.requires_grad:
      self.grad = 1.0
      for node in reversed(self.params):
        node._backward()

  def optimize(self, learning_rate=0.01):
    for node in self.params:
      node.data -= learning_rate * node.grad

  def zero_(self):
    self.grad = 0.0

  def zero_grad(self):
    for node in self.params:
      node.grad = 0.0

  def __repr__(self):
    return f'Value({self.data})'

In addition to the _backward() function which calculated the derivatives for only immediate previous nodes, we now have the backward() function which calculates derivates for all the nodes (we also won’t forget to set the gradient to 1.0 in the beginning). Once we plot the graph, pay attention that the input and leaf node gradients which we have no control over are not calculated, thanks to requires_grad.

x1 = Value(2, label='x1')
x2 = Value(5, label='x2')
y  = Value(1, label= 'y')

n = Neuron()

pred = n(x1, x2);      pred.label = 'pred'
L = y - pred;          L.label = 'loss'

L.backward()
draw_dot(L)

Computation Graph

We can finally implement complete backpropogatation with the goal of increasing the final probability to 1.0 (decreasing the loss down to zero).

# gradient descent
L.optimize()

# forward pass
pred = n(x1, x2);      pred.label = 'pred'
L = y - pred;          L.label = 'loss'

draw_dot(L)

Computation Graph

Let’s repeat the backpropagation in multiple epochs until we achieve a minimal loss. We will also print the parameters to see when our neuron function returns a maximum probability for the given input values. And we will make sure to not forget to reset the gradients.

while True:
  L.zero_grad()

  # backward pass
  L.backward()

  # gradient descent
  L.optimize()

  # forward pass
  pred = n(x1, x2);
  L = y - pred;

  print(f'Loss {L.data:.4f}')

  if L.data < 0.01:
    print(f'\nInputs: {x1} {x2}')
    print(f'Parameters: {n.w1} {n.w2} {n.b}')
    print(f'Prediction Probability: {pred.data}')
    break

# output
Loss 0.0148
Loss 0.0148
Loss 0.0147
...
Loss 0.0101
Loss 0.0100

Inputs: Value(2) Value(5)
Parameters: Value(0.15289672859342332) Value(0.8555124993367387) Value(0.013701999952082168)
Prediction Probability: 0.9900191690162561

N-dimensional Neuron

We have just now trained our 2-dimensional input neuron to find suitable parameter values for achieving a maximum probability for input values x1 and x2. Now we would like to create N-dimensional neuron which will accept much more inputs, similar to what we saw in the illustration of artificial neuron: x₁ ... x_n. As a consequence, our neuron will have to learn the parameter values for N-dimensional weights w₁ ... w_n.

class Neuron:
  def __init__(self, N):
    self.W = [Value(np.random.uniform(-1, 1), label=f'w{i}', requires_grad=True) for i in range(N)]
    self.b = Value(0, label='b', requires_grad=True)

  def __call__(self, X):
    out = sum((x * w for x, w in zip(X, self.W)), self.b)
    return out.sigmoid()

We will now see the training output of our N-dimensional neuron which will accept N Value inputs as a list. Note that our Neuron which implements sigmoid (logistic) activation is known as Logistic Regression.

X = [Value(x, label=f'x{i}') for i, x in enumerate([5, 0.4, -1, -2])]

n = Neuron(len(X))

pred = n(X);           pred.label = 'pred'
L = y - pred;          L.label = 'loss'

draw_dot(L)

Computation Graph

while True:
  L.zero_grad()

  # backward pass
  L.backward()

  # gradient descent
  L.optimize()

  # forward pass
  pred = n(X)
  L = y - pred

  print(f'Loss {L.data:.4f}')

  if L.data < 0.01:
    print(f'\nInputs: {X}')
    print(f'Parameters: {n.W} {n.b}')
    print(f'Prediction Probability: {pred.data}')
    break

# output
Loss 0.1273
Loss 0.1235
Loss 0.1199
...
Loss 0.0101
Loss 0.0100

Inputs: [Value(5), Value(0.4), Value(-1), Value(-2)]
Parameters: [Value(0.6195076510193583), Value(0.988462556700058), Value(0.6814715991439656), Value(-0.8498024629867303)] Value(0.08692061974641961)
Prediction Probability: 0.9900282484345291

Artificial Neural Network

We managed to train our single neuron to learn a function for our input values. In reality, however, data is much more complex and we need to learn more complication functions. How to achieve that? By chaining many neurons together, similar to biological neuron. Each neuron will basically learn some portion of the overall function.

What we see above is an illustration of an artificial neural network. In the input layer we have three neurons, each separately accepting N-dimensional input values. The output values of each neuron are then fully connected, as inputs to the hidden layer with four neurons (note that there can be more than one hidden layer). And finally, the output of hidden layer neurons are passed as inputs to the output layer, which may, for example, predict probability scores for two classes.

We will now try to implement a fully connected feedforward neural network, which is often referred to as Multi-Layer Perceptron (MLP).

class Layer:
  def __init__(self, N, count):
    self.neurons = [Neuron(N) for _ in range(count)]

  def __call__(self, X):
    outs = [n(X) for n in self.neurons]
    return outs[0] if len(outs) == 1 else outs # flattening dimension if a single element

The code above creates a list of count number of neurons, each accepting N dimensional input. Let’s build our layers shown in the illustration above and connect them. Note that the input dimension of the next layer is the amount of neurons in the previous layer.

# input data and its dimension
X = [Value(x, label=f'x{i}') for i, x in enumerate([1, 4, -3, -2, 3])]
N = len(X)

# creating layers
in_layer = Layer(N, 3)
hid_layer = Layer(3, 4)
out_layer = Layer(4, 2)

# output of each layer is input to the next
X_hidden = in_layer(X)
X_output = hid_layer(X_hidden)
out = out_layer(X_output)

# let's plot either one of the outputs
draw_dot(out[0])

Computation Graph

We will further abstract away the neuron and layer creation inside the MLP class. We will then reimplement the exact same network.

class MLP:
  def __init__(self, N, counts):
    dims = [N] + counts # concatenates dimensions
    self.layers = [Layer(dims[i], dims[i+1]) for i in range(len(dims)-1)]

  def __call__(self, X):
    out = X
    for layer in self.layers:
      out = layer(out)
    return out

nn = MLP(N, [3, 4, 2])
out = nn(X)

Iris Dataset

It is time to judge our network by applying it to the real dataset. Even though applying neural networks to Fisher’s Iris dataset is a little overkill (as the dataset is simple), it will be a nice demonstration of our MLP’s capacity.

Iris dataset has 4-dimensional input samples with three possible output classes. We should be able to predict the class of the Iris flower based on the width and height values of its two elements. We will load our dataset and split it into train and test sets.

from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()

X = iris.data  # 50x3 4-dimensional samples
y = iris.target # 3 classes (0, 1, 2)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print(f'Train data shape: {X_train.shape}, {y_train.shape}')
print(f'Test data shape: {X_test.shape}, {y_test.shape}')
print(f'Input Samples:\n {X_train[:5]}')
print(f'Labels:\n {y_train[:5]}')

Train data shape: (120, 4), (120,)
Test data shape: (30, 4), (30,)
Input Samples:
 [[4.4 2.9 1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [6.5 3.  5.5 1.8]
 [6.4 3.1 5.5 1.8]
 [6.3 2.5 5.  1.9]]
Labels:
 [0 0 2 2 2]

We should then convert each element to be a Value object. But before that, let’s try out two ready scikit-learn classifiers, LogisticRegression and MLPClassifier, the latter of which can be seen as the extension of the former, which we will implement in its simplistic form. As we have already discussed, LogisticRegression is basically our Neuron class which uses sigmoid (logistic) function as its activation. And in fact, Logistic Regression will simply be our MLP with the layer size for just a single neuron. Thanks to numpy vectorization and other optimizations, the sklearn implementations will be extremely quick.

from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

model = LogisticRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)
accuracy = accuracy_score(y_test, preds)
print(f"Logistic Regression Accuracy: {accuracy:.2f}")

model = MLPClassifier()
model.fit(X_train, y_train)
preds = model.predict(X_test)
accuracy = accuracy_score(y_test, preds)
print(f"MLP Classifier Accuracy: {accuracy:.2f}")

# output
Logistic Regression Accuracy: 1.00
MLP Classifier Accuracy: 1.00

When converting labels to Value objects, we can technically scale them down to be between 0 and 1 (by multiplying them to 0.5). Then we can apply the simple Mean Squared Error (MSE). Because there are only three labels and the dataset is small, it will work, yet it will penalize ordinal labels unnecessarily. For example, our loss should give the same penalty if we predict 0 instead of 1 or 2 (it is either one type of flower or another). MSE, however, will give more penalty when predicting 0 instead of 2 and less penalty when predicting 0 instead of 1 because of the imaginary distances between numbers, which do not exist among the real labels.

# converting numpy float arrays into Value lists
# scaling labels to be between 0 and 1 to simplify our code
# again, it is not a right way to treat the labels
X_train = [[Value(x) for x in X] for X in X_train]
X_test = [[Value(x) for x in X] for X in X_test]
y_train = [Value(y) * Value(0.5) for y in y_train]
y_test = [Value(y) * Value(0.5) for y in y_test]

Training Custom MLP Classifier

We will now create a class for our model, in the style of scikit-learn models, more specifically MLPClassifier.

As noted above, what we do is simple and will work in this case, but is not right. Ideally, we should initially one hot encode the ground truth labels to describe them with only zeros and ones. Our final layer for multiclass classification, instead of sigmoid activation, should output logits (unprocessed predictions) passing through softmax function. The loss in this case should be Categorical Cross Entropy instead of MSE.

Exercise (Advanced): Write code for the correct implementation noted above. See Josh Starmer’s (StatQuest) videos which explain its theory, including the Iris dataset, argmax and softmax functions, as well as Cross Entropy.

class Classifier:
  def __init__(self, layer_sizes=[2, 3, 1]):
    self.layer_sizes = layer_sizes
    self.nn = None
    self.L = None
    self.iterations = 0

  def forward(self, Xs):
    out = [self.nn(X) for X in Xs]
    return out

  def predict(self, X_test):
    return self.forward(X_test)

  def train(self, X_train, y_train, learning_rate=0.01):
    preds = self.forward(X_train)
    self.L = self.mean_squared_error(y_train, preds)
    self.L.zero_grad()
    self.L.backward()
    self.L.optimize(learning_rate=learning_rate)
    print(f'Loss: {self.L.data:.4f}')

  def fit(self, X_train, y_train, learning_rate=0.01, num_epochs=50):
    if not self.nn: # in order to not restart training if nn exists
      self.nn = MLP(len(X_train), self.layer_sizes)
    for i in range(num_epochs):
      print(f'Training epoch {self.iterations + i + 1}')
      self.train(X_train, y_train, learning_rate)
    self.iterations += i + 1

  def mean_squared_error(self, y_train, preds):
    return sum([(y-y_hat)*(y-y_hat) for y, y_hat in zip(y_train, preds)], Value(0))

  def score(self, y_test, preds):
    return self.mean_squared_error(y_test, preds).data / len(y_test)

  def accuracy_score(self, y_test, preds):
    # due to incorrect handling of the labels
    # we need to scale y_test and preds values back
    y_test = [y * Value(2) for y in y_test]
    preds = [y_hat * Value(2) for y_hat in preds]
    correct = sum(1 for y, y_hat in zip(y_test, preds) if round(y_hat.data) == y.data)
    total = len(y_test)
    return (correct/total)

Our naive classifer is ready and we can now train our model and note the accuracy. However, unlike the optimized classifiers of the sklearn library, it will be much slower and inefficient. Try out experiments with different layer sizes and learning rates, and notice how it affects the training process and loss. As we have mentioned before, we can implement Logistic Regression by simply passing layer_sizes=[1] to our MLP classifier.

model = Classifier([1])
# model = Classifier([4, 1])

model.fit(X_train, y_train, learning_rate=0.002, num_epochs=90)

# output
Training epoch 1
Loss: 54.7921
...
   
Loss: 1.3718
Training epoch 88
Loss: 1.3710
Training epoch 89
Loss: 1.3701
Training epoch 90
Loss: 1.3693

preds = model.predict(X_train)
print(f'Custom MLP classifier accuracy on train Data: {model.accuracy_score(y_train, preds):.2f}')

# output
Custom MLP classifier accuracy on train Data: 0.97

preds = model.predict(X_test)
print(f'Custom MLP classifier accuracy on test Data: {model.accuracy_score(y_test, preds):.2f}')

# output
Custom MLP classifier accuracy on test Data: 0.97

for i in range(10):
  print(y_test[i].data * 2, preds[i].data * 2)

# output
0 1.7378279677947062
0 1.3786491598121862
0 1.7797233124612888
0 1.5546211532302696
0 0.7808593322180695
0 1.1323777346177633
0 0.03179964376674919
0 1.6900839613983492
0 1.818651271834627
0 1.5712929781499119
0 1.9175600703929185

Python Code from Neuron to Neural Network

Table of Contents

Recall: Backpropagation

Activation Function

Artificial Neuron

N-dimensional Neuron

Artificial Neural Network

Iris Dataset

Training Custom MLP Classifier