05. Convolutional Neural Network Architectures¶
Important
The notebook is currently under revision.
Info
The following source was consulted in preparing this material: Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. Dive into Deep Learning. Cambridge University Press. Chapter 8: Modern Convolutional Neural Networks.
For about a decade convolutional architectures dominated computer vision. Around 2020, however, a major shift occurred with the introduction of Vision Transformers (ViT). Follow-up work demonstrated that attention-based models could match or exceed the performance of many convolutional networks when trained on large datasets, which decreased the popularity of convolutional architectures. Despite this, architectures like ConvNeXt (Liu et al., 2022) revisited the CNN designThese included using larger depthwise convolution kernels, replacing ReLU with GELU activations, simplifying the residual block structure, adjusting the placement of normalization layers, and adopting training practices that had become common in transformer models. and showed that much of the performance gap came from differences in training practices rather than architectural limits. With these changes, convolutional networks were able to achieve results close to transformer-based models on ImageNet.
Today both approaches are widely used. Transformers are common in large-scale vision systems and multimodal models, while convolutional networks remain important due to their efficiency and simplicity, especially in applications where computation and memory are limited. Moreover, many ideas used in modern architectures, such as hierarchical feature extraction, residual connections, etc. were first developed in convolutional networks.
In previous notebooks we introduced basic convolutional neural networks such as LeNet and discussed optimization challenges like vanishing gradients. In this notebook we will discuss influential neural network architectures and their core ideas.
ImageNet Dataset¶
We discussed in our introduction that the progress in deep learning was limited by the lack of large labeled datasets. A major turning point came with the introduction of the Imagenet (Deng et al., 2009) dataset. The dataset was organized according to the WordNet hierarchy, a large lexical database of English nouns. The idea was to collect thousands of real-world images illustrating a wide range of classes so that machine learning systems could learn visual concepts at scale. Images were gathered from search engines and then verified by humans (crowdsourced) threefold using Amazon Mechanical Turk. The full ImageNet dataset contains more than 14 million images at $224 \times 224$ resolution, covering over 20,000 categories, which took about three years and (perhaps) a couple of hundred thousand dollars to build.
Most deep learning research focuses on a subset of introduced for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) benchmark, which was launched to evaluate algorithms for image classification and object detection. The classification task uses 1000 object categories, each with roughly 1000 training images. This results in approximately 1.2 million training images, along with a validation set of about 50,000 images and a separate test set used for leaderboard evaluation.
AlexNet¶
AlexNet (Krizhevsky et al., 2012) won ILSVRC 2012 by a large margin. Architecturally it was very close to LeNet: a sequence of convolutional and pooling layers followed by fully connected layers for classification. The main reasons it succeeded where earlier networks had not were basically much more data, better hardware, and a few critical design and training choices.
The increase in data and resolution made it possible (and necessary) to use a larger, deeper network in AlexNet. LeNet used sigmoid (and tanh) activations, which saturate for large positive or negative inputs, so gradients shrink and learning slows. AlexNet switched to ReLU, which does not saturate for $x > 0$, so gradients flow more easily and training of deeper nets became feasible. This change alone had a large impact on convergence speed and final accuracy.
Training AlexNet on a single GPU would have been extremely slow and difficult due to memory constraints. The GPUs available at the time (GTX 580 with 3GB memory) could not comfortably store the entire network and its intermediate activations, especially because the model contained very large fully connected layers. To address this, the authors split the network across two GPUs. Each GPU processed a subset of feature maps in many layers, and the two streams communicated only at specific points in the network.
Note
The codebase written in C++ by Alexei Krizhevsky was used for a few following years to train major CNN models. Today, deep learning frameworks handle GPU parallelism automatically. We usually define a single model and the framework distributes computation across devices using techniques such as data parallelism. Manually splitting a network across GPUs, as done in AlexNet, is rarely necessary today.
AlexNet made heavy use of dropout in the fully connected layers to reduce overfitting, and data augmentation to effectively enlarge the training set.
AlexNet also replaced the average pooling with max pooling. Max pooling outputs the largest activation in the window, preserving the strongest detected feature rather than averaging responses together. This helps the network retain the most salient signals during downsampling. AlexNet also introduced overlapping pooling, using $3 \times 3$ pooling windows with stride $2$. Because the windows overlap, adjacent pooling regions share some pixels. The authors reported that this design slightly reduced overfitting and improved classification performance.
Note
Some ideas introduced in the paper did not remain widely used. For example, local response normalization (LRN) has largely been replaced by batch normalization, which stabilizes training more effectively.
VGGNet¶
VGG (Simonyan and Zisserman, 2015) replaced the mix of kernel sizes in AlexNet with a simple, repetitive design: modular blocks. A VGG block is a small stack of $3 \times 3$ conv–ReLU layers (often two or three) then one $2 \times 2$ max pooling layer. The whole network is a sequence of such blocks, with channel counts increasing (e.g. 64, 128, 256, 512). This uniform structure made it easy to scale depth (VGG11, VGG16, VGG19) and influenced later designs.
Note
Simply stacking more VGG blocks eventually led to training difficulties (beyond the depth of 19 VGG returned diminishing returns), which residual connections and batch normalization later addressed.
To understand the architecture, it is important to explain the one more concept. The receptive field of a neuron is the set of input pixels that can influence its value. A single $5 \times 5$ convolution gives each output pixel a receptive field of $5 \times 5$: it sees 25 input pixels.
It turns out, two $3 \times 3$ convolutions in sequence also yield a receptive field of $5 \times 5$: the first $3 \times 3$ sees 9 pixels, and the second $3 \times 3$ sees 9 output values of the first layer, which themselves each see 9 input pixels. So one $5 \times 5$ and two $3 \times 3$ layers have the same receptive field.
With the same receptive field, two $3 \times 3$ layers are strictly more expressive: they insert an extra nonlinearity between the two convolutions, so the function is a composition of two linear plus nonlinear steps instead of one. Hence, the representational power of two kernels is higher.
At the same time they use fewer parameters. For $C$ input and output channels, a single $5 \times 5$ layer has $25 C^2$ parameters. Two $3 \times 3$ layers have $9 C^2 + 9 C^2 = 18 C^2$ parameters — about 30% fewer — while covering the same spatial context and allowing more flexible, hierarchical features. The same reasoning extends to three $3 \times 3$ layers replacing one $7 \times 7$: same receptive field, more non-linearities, fewer parameters.
Network in Network¶
Network in Network (NiN) (Lin et al., 2014) introduced two major ideas that became standard.
At first glance, a kernel of size $1 \times 1$ seems meaningless, as it does not combine information across height or width — only across channels. However, at each spatial position $(h, w)$, the layer takes the $C_{\text{in}}$ channel values and computes $C_{\text{out}}$ linear combinations of them. So it behaves like a small fully connected layer applied independently at every pixel. That gives the network a way to mix and compress channels (e.g. reduce 256 channels to 64) or add extra non-linear capacity without changing spatial resolution.
After the last convolutional layers we usually have a tensor of shape (batch, channels, height, width). Classical designs flatten this and feed it into one or more large fully connected layers to produce class scores. Those dense layers contain most of the parameters and are prone to overfitting. Global average pooling layer introduced by NiN replaces that with a single step: average each channel over all spatial positions. So for each of the $C$ channels we compute one number (the mean over $H \times W$). The result is a vector of length $C$, which can be fed directly into the classifier (or a single linear layer). No flattening, no huge matrices to keep in the memory, just one number per channel. That drastically cuts parameters, reduces overfitting, and makes the output invariant to the spatial size of the feature map, so the same network can handle slightly different input resolutions. Since NiN, GAP is common in modern CNNs.
Inception (GoogLeNet)¶
Inception(Szegedy et al., 2014) introduced the Inception block.It was named Inception because of the internet meme from the infamous Inception movie. If you don't believe this, scroll down the paper for references section and check out the very first reference. Indeed, why would we select among many kernel sizes, can't we use them all? Instead of choosing one kernel size per layer, the block runs several convolutions in parallel (e.g. $1 \times 1$, $3 \times 3$, $5 \times 5$) and concatenates their outputs along the channel dimension. The network can thus combine fine-grained ($1 \times 1$), local ($3 \times 3$), and broader ($5 \times 5$) patterns in one place, and the relative importance of each path is learned.
A naive version, however, would be very expensive. The specific model of the Inception paper called GoogLeNet A play on words: GoogLeNet 1) was developed by Google researchers, and 2) pays homage to the LeNet architecture. therefore, inspired by NiN, uses $1 \times 1$ convolutions as bottlenecks before the larger kernels: they reduce the channel count, so the subsequent $3 \times 3$ and $5 \times 5$ layers do far fewer operations while still capturing multi-scale structure.
The full Inception module typically has four branches (e.g. $1 \times 1$ only; $1 \times 1$ then $3 \times 3$; $1 \times 1$ then $5 \times 5$; $3 \times 3$ max-pool then $1 \times 1$), with the $1 \times 1$ layers keeping cost down. GoogLeNet also uses global average pooling and auxiliary classifiers (small heads with their own loss function attached to intermediate layers during training) to improve gradient flow and act as regularizers. The auxiliary classifiers are discarded at inference time, so they do not affect the final model cost.
Note
In modern architectures auxiliary heads are generally no longer necessary. Later models improved gradient flow through techniques such as residual connections (e.g. Inception-ResNet), reducing the need for deep supervision.
We saw how increasing the number of layers in neural networks for learning more advanced functions is challenging due to issues like vanishing gradients. VGGNet partially addressed this problem by using repetitive blocks that stack multiple convolutional layers before downsampling with max-pooling. For instance, two consecutive $3 \times 3$ convolutional layers achieve the same receptive field as a single $5 \times 5$ convolution, while preserving a higher spatial resolution for the next layer. In simpler terms, repeating a smaller kernel allows the network to access the same input pixels while retaining more detail for subsequent processing. Larger kernels blur (downsample) the image more aggressively, which can lead to the loss of important details and force the network to reduce resolution earlier in the architecture and stop.
We also discussed how VGGNet was still limited and showed diminishing returns beyond 19 layers. Inception architecture significantly reduced parameter count and leveraged the advantages of the $1 \times 1$ convolution kernel. Despite enabling deeper networks with far fewer parameters, Inception did not fully resolve the core training and convergence problems faced by very deep models.
As a consequence, Batch Normalization (Ioffe and Szegedy, 2015) and Residual Networks (He et al., 2015) emerged as two major solutions for efficiently training neural networks as deep as 100 layers and more. We will now set up the data environment and go on discussing the core ideas and implementations of both papers. We had already introduced the CIFAR dataset in our previous notebook.
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
DATA_PATH = './data'
BATCH_SIZE = 32
cifar_mean = (0.4914, 0.4822, 0.4465)
cifar_std = (0.2470, 0.2435, 0.2616)
data_transforms = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(cifar_mean, cifar_std),
])
train_data = datasets.CIFAR10(root=DATA_PATH, train=True, download=True, transform=data_transforms)
test_data = datasets.CIFAR10(root=DATA_PATH, train=False, download=True, transform=data_transforms)
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, pin_memory=True, num_workers=2)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=False, pin_memory=True, num_workers=2)
Note
From machine learning, we know that, it is encouraged to split the data into training, validation (also called dev), and test sets. When the dataset is not large, an 80 : 10 : 10 split is a reasonable ratio for allocation. For larger datasets (e.g. with one million images), it is fine to allocate 95% or more of your data for training. The training set is used to update the model's parameters. The validation set is used for tuning hyperparameters (e.g. testing different learning rates, regularization strengths, etc.). The test split should ideally be used only once to report the final performance of the selected model (e.g. for inclusion in a research paper).
Batch Normalization¶
Batch normalization standardizes the hideen layer activations of a neural network during training. Instead of allowing the distribution of activations to vary freely from batch to batch, the layer normalizes them using statistics computed from the current mini-batch.
Note
The terms batch and mini-batch are often used interchangeably in deep learning, although they are not exactly the same. In the strict sense, a batch refers to the entire training dataset processed in a single update of the model parameters. A mini-batch refers to a smaller subset of the dataset processed together before computing the gradient and updating the model. In practice, most deep learning libraries use the word batch to mean mini-batch. For instance, the parameter batch_size in PyTorch specifies how many samples are processed together in one forward and backward pass, not the entire dataset.
For example, if a dataset contains 50,000 training images and the batch size is set to 128, the model will process 128 images at a time and update the parameters after each group. In this case, the algorithm performs many updates during one pass through the dataset (one epoch). This approach is called mini-batch gradient descent.
For each feature value $x_i$ in the mini-batch, we compute the batch mean $\mu_B$ and batch variance $\sigma_B^2$, and normalize the value by subtracting the mean and dividing by the standard deviation. A small constant $\epsilon$ is added for numerical stability so that the denominator never becomes zero:
$$ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} $$
After this transformation, the normalized values $\hat{x}_i$ have a mean close to $0$ and a variance close to $1$ within that mini-batch. If this normalization were applied alone, it could restrict the representational flexibility of the network. To allow the model to learn the appropriate scale and offset of the activations, batch normalization introduces two learnable parameters: a scaling parameter $\gamma$ and a shifting parameter $\beta$.
$$ BN(x_i) = \gamma \hat{x}_i + \beta $$
These parameters are learned together with the rest of the model during training. If necessary, the network can recover the original distribution of activations by choosing appropriate values for $\gamma$ and $\beta$.
Note
When a mini-batch passes through the network, each neuron produces one value (feature) for every example. Batch normalization looks at these values together. The layer first computes the average value of that feature in the mini-batch and measures how much the values vary. It then shifts and rescales them so that they stay in a similar numerical range. This keeps the activations stable while the network is learning.
If the process stopped here, every feature would always remain normalized, which could make the network too restrictive. Neural networks need the ability to adjust how strongly signals pass through a layer. For this reason batch normalization adds two learnable parameters. One allows the network to stretch the values and the other allows it to shift them. During training the model learns how much normalization is useful and how much it should modify it.
Batch normalization is typically placed after the affine transformation of a layer and before the non-linear activation function. In other words, the linear mapping is applied first, the result is normalized, and only then the activation function is evaluated:
$$ z = \phi(\textrm{BN}(x)). $$
Important
Note that the bias term $b$ is often omitted when batch normalization is used. The reason is that the shifting role of the bias is already provided by the learnable parameter $\beta$ of the batch normalization. In practice, many implementations therefore disable the bias parameter in layers with bias = False.
Training very deep neural networks is difficult because the scale of activations can change significantly from layer to layer during learning. As parameters are updated, the distribution of intermediate activations also shifts. Each layer must continuously adapt to these changes, which slows down optimization and can make training unstable.
We had already seen parameter initialization in our previous notebook. Methods such as He initialization choose weight variances so that signals neither explode nor vanish as they propagate through the network. While these techniques help at the start of training, the distributions of activations can still drift as learning progresses. Batch normalization stabilizes these intermediate activations during training.
Keeping activations within a predictable numerical range makes gradient-based optimization more reliable. Normalization reduces the risk of exploding or vanishing gradients, allowing deeper networks to train effectively. Because the scale of inputs to each layer is controlled, larger learning rates can often be used. Since the statistics are computed from a random mini-batch, a small amount of noise is introduced into the activations, which can improve generalization.
Note
The original paper argued that the method improves training by reducing internal covariate shift. This term refers to the phenomenon where the distribution of activations inside a network changes as the parameters of earlier layers are updated. If the input distribution to a layer keeps shifting, the layer must constantly adapt, which can slow down learning.
Later research has suggested that the primary benefit of batch normalization may not be the reduction of this distribution shift itself. Instead, many studies indicate that normalization improves the geometry of the optimization problem, producing smoother loss surfaces and better-conditioned gradients. This makes gradient-based optimization more stable and allows larger learning rates. As a result, the exact mechanism behind the success of batch normalization is still discussed in the literature, although its practical effectiveness is well established.
We will implement a batch normalization function and compare it with the torch.nn.BatchNorm2d module of PyTorch.
import torch
def BatchNorm2d(X, gamma=None, beta=None, eps=1e-5):
"""
Batch normalization for input (N, C, H, W).
Statistics are computed per-channel over (N, H, W).
Variance uses the unbiased population estimate (divide by N, not N-1).
"""
mean = X.mean(dim=(0, 2, 3), keepdim=True)
var = X.var(dim=(0, 2, 3), unbiased=False, keepdim=True)
X_hat = (X - mean) / torch.sqrt(var + eps)
if gamma is not None and beta is not None:
X_hat = gamma * X_hat + beta
return X_hat
from torchvision import models
model = models.vgg11()
model.features[:6]
Sequential( (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU(inplace=True) (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (4): ReLU(inplace=True) (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) )
Important
The terms activation, feature, and feature map are often used interchangeably. An activation is the numerical output of a layer, while a feature refers to the same values interpreted as learned representations useful for a task. For example, model.features returns intermediate layer outputs that can be described both as activations and feature maps.
X, _ = next(iter(train_loader))
# obtaining the layer activations
with torch.no_grad():
A = model.features[:3](X)
A.shape
torch.Size([32, 64, 16, 16])
gamma = torch.ones(1, A.shape[1], 1, 1)
beta = torch.zeros(1, A.shape[1], 1, 1)
BN = BatchNorm2d(A, gamma, beta)
BN.shape
torch.Size([32, 64, 16, 16])
f"Input stats: {A.mean((0,2,3))[:3]} {A.var((0,2,3), unbiased=False)[:3]}"
'Input stats: tensor([0.1716, 0.1092, 0.0713]) tensor([0.0418, 0.0192, 0.0072])'
f"Input stats after BN: {BN.mean((0,2,3))[:3]}, {BN.var((0,2,3), unbiased=False)[:3]}"
'Input stats after BN: tensor([ 3.9581e-08, -3.5390e-08, 7.4506e-09]), tensor([0.9998, 0.9995, 0.9986])'
import torch.nn as nn
BN_torch = nn.BatchNorm2d(A.shape[1], affine=False)(A)
f"Input stats after PyTorch BN: {BN_torch.mean((0,2,3))[:3]}, {BN_torch.var((0,2,3), unbiased=False)[:3]}"
'Input stats after PyTorch BN: tensor([ 4.4703e-08, -1.8626e-09, -7.4506e-09]), tensor([0.9998, 0.9995, 0.9986])'
Running Statistics¶
During training, batch normalization computes the mean and variance from the current mini-batch. These statistics are then used to normalize the activations. However, during inference the model may process a single example or a batch that does not represent the full training distribution. If normalization relied only on the current batch, the output of the network could change depending on which samples appear together.
For this reason, batch normalization layers maintain running estimates of the mean and variance observed during training. These estimates approximate the statistics of the full dataset and are used during evaluation. When a model is switched to evaluation mode (e.g. with model.eval() in PyTorch), the stored running statistics are used instead of the statistics of the current batch. This ensures stable and deterministic predictions.
PyTorch implementations automatically maintain running statistic (with track_running_stats). The global mean and variance values are updated during training using an exponential moving average:
$$ \mu_{\text{running}} = (1 - \alpha)\,\mu_{\text{running}} + \alpha\,\mu_{\text{batch}} $$
$$ \sigma^2_{\text{running}} = (1 - \alpha)\,\sigma^2_{\text{running}} + \alpha\,\sigma^2_{\text{batch}}. $$
Note
An Exponential Moving Average (EMA) is a method for smoothing a sequence of values over time. Instead of keeping all past observations, EMA maintains a running estimate that is updated whenever a new value is observed. If $x_t$ is the new value at step $t$ and $m_{t-1}$ is the previous estimate, the updated value is $$ m_t = (1 - \alpha)m_{t-1} + \alpha x_t $$ where $0 < \alpha \le 1$ controls how quickly the estimate reacts to new data. A larger $\alpha$ makes the average respond more strongly to recent values, while a smaller $\alpha$ produces a smoother estimate that changes more slowly. Because older values are repeatedly multiplied by $1-\alpha$ , their influence decays exponentially over time, which is why the method is called an exponential moving average.
In PyTorch the parameter controlling this update is also called momentum. Despite the name, it is unrelated to the momentum used in optimization algorithms. Instead, it determines how quickly the running statistics adapt to new batches. A larger value updates the statistics more aggressively using recent batches. A smaller value averages information over a longer history of batches. The default value in PyTorch is 0.1. During evaluation, batch normalization uses these stored statistics instead of recomputing them from the input batch.
def BatchNorm2d(X, running_mean, running_var, training=True, momentum=0.1, eps=1e-5):
if training:
mean = X.mean((0,2,3))
var = X.var((0,2,3), unbiased=False)
running_mean = (1 - momentum) * running_mean + momentum * mean
running_var = (1 - momentum) * running_var + momentum * var
else:
mean = running_mean
var = running_var
X_hat = (X - mean[None,:,None,None]) / torch.sqrt(var[None,:,None,None] + eps)
return X_hat, running_mean, running_var
running_mean = torch.zeros(X.shape[1])
running_var = torch.ones(X.shape[1])
for step, (X, _) in enumerate(train_loader):
batch_mean = X.mean((0,2,3))
_, running_mean, running_var = BatchNorm2d(X, running_mean, running_var, training=True, momentum=0.1)
if (step+1) % 25 == 0 or step == 0:
print(f"Batch {step+1} mean: {batch_mean.detach().cpu().numpy().round(8)}")
print(f"Running {step+1} mean: {running_mean.detach().cpu().numpy().round(8)}\n")
if step+1 == 100:
break
Batch 1 mean: [-0.0209811 -0.10714138 -0.16730848] Running 1 mean: [-0.00209811 -0.01071414 -0.01673085] Batch 25 mean: [-0.01610584 -0.03596737 -0.06363342] Running 25 mean: [ 0.00858956 -0.0010609 -0.03079914] Batch 50 mean: [-0.12941505 -0.0811762 0.01222364] Running 50 mean: [-0.00949571 0.00525985 0.01912233] Batch 75 mean: [0.127812 0.1735982 0.09070341] Running 75 mean: [0.01514888 0.02700433 0.03230472] Batch 100 mean: [-0.03082126 -0.04937264 -0.13929878] Running 100 mean: [ 0.01407202 0.01071173 -0.00154937]
Note
The running mean converges to the dataset mean because its update is an exponential moving average: $$m_{t+1}=(1-\alpha)m_t+\alpha\,b_t,$$ where $m_t$ is the running mean, $b_t$ is the batch mean at step $t$, and $\alpha$ is the momentum. If batches are sampled from a fixed data distribution with mean $\mu$, then $\mathbb{E}[b_t]=\mu$. Taking expectation of the update gives $$\mathbb{E}[m_{t+1}]=(1-\alpha)\mathbb{E}[m_t]+\alpha\mu.$$ Subtract $\mu$ from both sides to track the error relative to the true mean: $$\mathbb{E}[m_{t+1}]-\mu=(1-\alpha)(\mathbb{E}[m_t]-\mu),$$ so the error shrinks geometrically as $|1-\alpha| < 1$: $$\mathbb{E}[m_t] - \mu = (1-\alpha)^t (m_0 - \mu) \to 0$$ Hence, we get $\mathbb{E}[m_t]\to \mu$.
Important
A rule of thumb is that batch sizes between 50-100 generally work well for batch normalization: the batch is large enough to return reliable statistics but not so large that it causes memory issues or slows down training unnecessarily. Batch size of 32 is usually the lower bound where batch normalization still provides relatively stable estimates. Batch size of $128$ is also effective if the hardware allows, and can produce even smoother estimates. Beyond that the benefit often diminishes.
Layer Normalization¶
If the batch size is very small due to memory limitations, batch Normalization may lose effectiveness because its statistics depend on the current mini-batch. In such situations it is often better to use Layer Normalization (Ba et al., 2016), which does not rely on the batch dimension.
Instead of computing statistics across the batch, layer normalization computes them across the features of each individual sample. For an input vector $x \in \mathbb{R}^d$, the mean and variance are
$$ \mu = \frac{1}{d}\sum_{i=1}^{d} x_i, \qquad \sigma^2 = \frac{1}{d}\sum_{i=1}^{d}(x_i - \mu)^2 $$
The normalized output is
$$ y_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} \gamma_i + \beta_i $$
Because these statistics are computed within each sample, the behavior of the layer is the same during training and inference. Hence, there is no need to track the running statistics.
Example
Suppose a hidden layer produces the vector $ x = [2,\,4,\,6,\,8]. $ Layer normalization computes the mean and variance using only these four values, then rescales the vector so that it is centered and normalized. This happens independently for each sample.
Batch normalization and layer normalization stabilize neural network training, but they work in different ways and are suited to different architectures.
Batch normalization computes statistics across the mini-batch (and across spatial dimensions in CNNs). Because it uses many values to estimate the mean and variance, the estimates are usually stable and it often improves optimization speed and generalization in convolutional networks. However, its behavior depends on the batch size and it requires running statistics during inference.
Layer normalization computes statistics using only the features of a single sample, so it behaves the same during training and inference and does not depend on the batch size. This makes it particularly effective for transformers, sequence models, and settings where batches are small or variable.
In CNNs, batch normalization benefits from the structure of feature maps. A convolutional layer produces an activation tensor of shape $N \times C \times H \times W$ where statistics for each channel using all samples and all spatial positions are computed:
$$ \mu_c = \frac{1}{N H W} \sum_{n=1}^{N} \sum_{h=1}^{H} \sum_{w=1}^{W} x_{n,c,h,w} $$
Because every channel contains many spatial values, the normalization uses a large number of observations. This makes the estimated mean and variance stable and often improves optimization in CNNs even with small batch sizes.
Exercise
Implement layer normalization.
Residual Network¶
Residual neural network (ResNet) consists of repeated residual blocks. Each residual block consists of a residual skip (shortcut) connection.
The idea of the residual connection is simple. The input first passes through two weight layers with an activation in between, producing a transformation $g(x)$. Before applying the final activation, the original input $x$ is added to this transformation: $f(x) = g(x) + x$. The result is then passed through the activation function. This skip connection allows the network to preserve the original signal and makes very deep networks easier to train. In the code below, we will create ResidualBlock by also integrating batch normalization.
import torch.nn.functional as F
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn2 = nn.BatchNorm2d(channels)
def forward(self, x):
out = self.conv1(x)
out = self.bn1(out)
out = F.relu(out)
out = self.conv2(out)
out = self.bn2(out)
return F.relu(out + x)
ResNet was originally developed for image classification tasks, winning ImageNet competition. Inspired by VGGNet, each of its residual block consisted of two $3 \times 3$ convolutions, both integrating batch normalization, followed by a skip connection. Currently, transformers incorporate residual connections heavily in their design.
Note
Shape mismatch is a common issue when implementing residual blocks. A skip connection performs elementwise addition, so both tensors must have the same shape $(N, C, H, W)$. In the example, no issues occur because the convolutions preserve the input shape. The number of channels stays the same, and the spatial dimensions are kept constant by using a $3 \times 3$ kernel with padding $1$ and stride $1$. Recall that the output size of a convolution (for height/width) is $$ H_{out} = \frac{H + 2p - k}{s} + 1 $$ With $k=3$, $p=1$, and $s=1$, the result is $H_{out}=H$. This means the transformation $g(x)$ has the same dimensions as the input, allowing the skip connection to be added safely.
In deeper networks the shape may intentionally change, for example when increasing the number of channels or reducing spatial resolution. In those cases architectures align the shapes using a $1 \times 1$ convolution on the skip path before performing the addition.
Similar to batch normalization, the advantages of residual connections will be obvious in case of 50 layers or more, with repeated residual blocks. But why adding input of the layer to the second affine transformation boosts training? We are providing a slightly oversimplified intuition below.
Let's take any deep learning model. The types of functions this model can learn depend on its design (e.g. number of layers, activation functions, etc). All these possible functions we can denote as class $\mathcal{F}$. If we cannot learn a perfect function for our data, which is usually the case, we can at least try to appoximate this function as closely as possible by minimizing a loss. We may assume that a more powerful model can learn more types of functions and show better performance. But that's not always the case. To achieve a better performance than a simpler model, our model must be capable of learning not only more functions but also all the functions the simpler model can learn. Simply, the possible function class of the more powerful model should be a superclass of the simpler model's function class $\mathcal{F} \subseteq \mathcal{F}'$. If the $\mathcal{F}'$ is not an expanded version of $\mathcal{F}$, the new model might actually learn a function that is farther from the truth, and even show worse performance.
Refer to the figure above, where our residual output is $f(x) = g(x) + x$. If some activation nodes in our network are unnecessary and increase complexity or learn bad representations, instead of learning weights and biases, our residual block can now learn an identity function $f(x) = x$ by simply setting that nodes parameters to zero. As a result, our inputs will propagate faster while ensuring that the learned functions are within the biggest function domain. By providing an identity path through which gradients can flow directly across layers, skip connections alleviate the vanishing gradient problem.
Residual blocks also resemble dropout. To differentiate, you can imagine that dropout blocks the path, while residual connection allows the network to learn more functions by helping inputs to "jump over" (skip) the nodes. And it is very important that the function classes of the model with residual blocks is a superset of the same model without such blocks. Hence, residual connections allow the model to represent complex functions while retaining easy access to simpler ones through the identity mapping, which improves optimization and mitigates vanishing gradients.
ResNeXt¶
ResNeXt (Xie et al., 2017) builds directly on ResNet. Instead of only increasing the depth (number of layers) or width (number of channels), ResNeXt increases the number of parallel transformations applied to the same input. A ResNeXt block can be written as
$$ y = x + \sum_{i=1}^{g} T_i(x), $$
where $x$ is the input, $T_i$ are parallel transformations, and $g$ is the cardinality (the number of groups). The outputs of these transformations are aggregated and then added to the input through the residual connection.
These parallel paths are implemented using grouped convolutions. A grouped convolution splits the input channels into $g$ groups and applies a convolution independently to each group. This efficiently approximates multiple parallel transformations while keeping the block structure simple.
Note
The authors showed that increasing cardinality can be more effective than simply increasing network depth or width.
DenseNet¶
DenseNet (Huang et al., 2017) aimed to improve upon ResNet. The main difference is that ResNet adds previous features through summation, while DenseNet concatenates them. Summation keeps the number of channels fixed and is memory efficient, but concatenation preserves all earlier representations explicitly. In practice, DenseNet often achieves strong accuracy with fewer parameters, though it may be slower or heavier in memory due to storing many feature maps.
If a block has layers $x_0, x_1, x_2, \dots$, then each new layer receives as input the concatenation of all previous feature maps: $$ x_\ell = H_\ell([x_0, x_1, \dots, x_{\ell-1}]), $$ where $[\cdot]$ denotes concatenation along the channel dimension, and $H_\ell$ is typically a small sequence such as BatchNorm–ReLU–Conv. In a standard network, each layer passes its output only to the next one, so earlier features may gradually get overwritten or forgotten. In DenseNet, later layers can directly access low-level and mid-level features from all earlier layers. This improves gradient flow, encourages reuse of already computed representations, and reduces the need for each layer to relearn similar features.
Because outputs are concatenated rather than summed, the number of channels grows as the block gets deeper. DenseNet therefore introduces a few ideas to make computation manageable. For example, the notion of a growth rate $k$: each layer produces only a small number of new feature maps, for example 12, 24, or 32. If a block starts with $C_0$ channels and has $L$ layers, then the output has $$ C_0 + Lk $$ channels. So the network grows gradually instead of exploding immediately.
SqueezeNet / MobileNet / EfficientNet¶
Several later CNN architectures focused primarily on improving efficiency rather than introducing fundamentally new architectural ideas. SqueezeNet (Iandola et al., 2016) reduced model size dramatically using fire modules, where a $1 \times 1$ convolution first compresses the number of channels (the squeeze step) and parallel $1 \times 1$ and $3 \times 3$ convolutions then expand the representation. By relying heavily on $1 \times 1$ filters, SqueezeNet achieved AlexNet-level accuracy with far fewer parameters. MobileNet (Howard et al., 2017) introduced depthwise separable convolutions, which split a standard convolution into a depthwise spatial convolution followed by a $1 \times 1$ pointwise convolution that mixes channels. This factorization greatly reduces parameters and computation and became widely used in lightweight models for mobile and edge devices. EfficientNet (Tan and Le, 2019) addressed the problem of scaling networks. Instead of increasing depth, width, or input resolution independently, the authors proposed compound scaling, where all three are increased together in a balanced way. Starting from a baseline model built from mobile inverted bottleneck blocks, the architecture is scaled to produce a family of models (EfficientNet-B0-B7) that offer a trade-off between accuracy and computational cost, allowing practitioners to choose a model that matches their available compute while maintaining strong performance.
AnyNet¶
Most early CNN architectures were designed largely through manual experimentation and intuition. However, the space of possible network architectures is extremely large, making it difficult to identify optimal designs by hand. One approach to this problem is Neural Architecture Search (NAS), which attempts to automatically discover good network architectures using optimization methods such as reinforcement learning, evolutionary algorithms, or gradient-based search. NAS methods evaluate many candidate architectures and select the one that performs best on a validation task. While this approach can produce strong models (for example EfficientNet), it often requires enormous computational resources.
An alternative idea is to search not for a single best architecture but for a design space of good architectures. The AnyNet framework (Radosavovic et al., 2020) defines such a space of CNNs with a simple template.
Networks are divided into three main components: a stem, a body, and a head. The stem performs initial processing of the input image, the body contains the main sequence of convolutional blocks organized into stages with decreasing spatial resolution, and the head converts the final representation into class predictions, typically using global average pooling followed by a linear classifier. Within this template, several architectural parameters can be varied, such as the number of channels in each stage, the number of blocks per stage, and the number of groups used in grouped convolutions.
Transfer Learning and Fine-Tuning¶
Training a deep network from scratch requires a large amount of labeled data and compute. In many applications we have only a small or medium-sized dataset for our target task, but we have access to models that were already trained on a related, large-scale source task such as ImageNet classification. Transfer learning is the practice of reusing such a pretrained model for a new task. The underlying idea is that the early and middle layers of a CNN learn general-purpose visual features (edges, textures, shapes) that transfer well across tasks, while the final layers are more task-specific. Instead of learning everything from scratch, we keep most of the pretrained representation and adapt only the parts that must change for the new setting.
A typical transfer learning workflow has two phases. In the first phase we often freeze the pretrained backbone (all layers except the classifier) and train only a new classification head on our dataset. This is what many people usually mean by transfer learning: we transfer the feature extractor and learn a new mapping from features to our class labels.
In the second phase we may unfreeze some or all of the backbone and continue training with a small learning rate. This is fine-tuning). Fine-tuning allows the network to adjust the pretrained features to the new data, which can improve accuracy when the target task or domain differs from the source, but you need to be careful: too large a learning rate can overwrite the useful pretrained weights and hurt performance.
We will use a pretrained MobileNet (lightweight and fast to run) and the CIFAR-10 dataset we set up earlier. Recall that the pretrained model was trained for ImageNet (1000 classes), CIFAR-10 has 10 classes, so we must replace the classifier head and train it (and optionally part of the backbone) on CIFAR-10.
from torchvision import models
backbone = models.mobilenet_v3_small(weights=models.MobileNet_V3_Small_Weights.IMAGENET1K_V1)
backbone.eval()
backbone.classifier
Sequential( (0): Linear(in_features=576, out_features=1024, bias=True) (1): Hardswish() (2): Dropout(p=0.2, inplace=True) (3): Linear(in_features=1024, out_features=1000, bias=True) )
As the number of input features to the backbone head is fixed by the last channel dimension before global pooling, we replace the entire classifier with a single linear layer that maps from that feature dimension to 10 classes.
num_classes = 10
in_features = backbone.classifier[0].in_features
backbone.classifier = nn.Linear(in_features, num_classes)
# Sanity check: forward pass on a batch (CIFAR images are 32x32, backbone accepts any spatial size)
X, _ = next(iter(train_loader))
with torch.no_grad():
out = backbone(X)
out.shape # (batch_size, 10)
torch.Size([32, 10])
For the transfer learning phase we freeze the backbone so that only the new linear head is updated. This keeps the pretrained features fixed and avoids overfitting when the target dataset is small. We set requires_grad = False for all parameters except those in the classifier.
for p in backbone.parameters():
p.requires_grad = False
for p in backbone.classifier.parameters():
p.requires_grad = True
# Only the classifier parameters are trained
trainable = [p for p in backbone.parameters() if p.requires_grad]
len(trainable), sum(p.numel() for p in trainable)
(2, 5770)
We train for a few epochs using only the classifier parameters. The backbone acts as a fixed feature extractor, the loss and gradients only affect the new head.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = backbone.to(device)
optim = torch.optim.Adam(model.classifier.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
def eval_acc(loader):
model.eval()
correct, total = 0, 0
with torch.no_grad():
for X, y in loader:
X, y = X.to(device), y.to(device)
pred = model(X).argmax(dim=1)
correct += (pred == y).sum().item()
total += y.size(0)
return correct / total
epochs = 3
for e in range(epochs):
model.train()
for X, y in train_loader:
X, y = X.to(device), y.to(device)
optim.zero_grad()
loss = loss_fn(model(X), y)
loss.backward()
optim.step()
print(f"Epoch {e+1}/{epochs} train acc: {eval_acc(train_loader):.4f} test acc: {eval_acc(test_loader):.4f}")
Epoch 1/3 train acc: 0.3318 test acc: 0.3284 Epoch 2/3 train acc: 0.3328 test acc: 0.3233 Epoch 3/3 train acc: 0.3375 test acc: 0.3275
Fine-tuning¶
After the head has adapted to the new task, we can unfreeze the backbone and continue training with a smaller learning rate. This allows the earlier layers to adapt slightly to the target data without destroying the pretrained features. It is common to use a learning rate that is an order of magnitude (or more) smaller for the backbone than for the classifier. Below we unfreeze all parameters and use a single small learning rate for the whole model.
# Unfreeze all parameters
for p in model.parameters():
p.requires_grad = True
optim_ft = torch.optim.Adam(model.parameters(), lr=1e-5)
epochs = 10
for e in range(epochs):
model.train()
for X, y in train_loader:
X, y = X.to(device), y.to(device)
optim_ft.zero_grad()
loss = loss_fn(model(X), y)
loss.backward()
optim_ft.step()
print(f"Fine-tune epoch {e+1}/{epochs} test acc: {eval_acc(test_loader):.4f}")
Fine-tune epoch 1/10 test acc: 0.3699 Fine-tune epoch 2/10 test acc: 0.4034 Fine-tune epoch 3/10 test acc: 0.4297 Fine-tune epoch 4/10 test acc: 0.4487 Fine-tune epoch 5/10 test acc: 0.4671 Fine-tune epoch 6/10 test acc: 0.4829 Fine-tune epoch 7/10 test acc: 0.4933 Fine-tune epoch 8/10 test acc: 0.5051 Fine-tune epoch 9/10 test acc: 0.5136 Fine-tune epoch 10/10 test acc: 0.5230
For very large models (e.g. large language models), full fine-tuning of all parameters is expensive and often unnecessary. Parameter-efficient methods such as LoRA (Low-Rank Adaptation) (Hu et al., 2021) inject small low-rank matrices into selected layers and train only those matrices instead of the full weights. This drastically reduces the number of trainable parameters and memory use while still adapting the model to the target task. QLoRA (Dettmers et al., 2023) combines quantization of the base model with LoRA, so that the pretrained weights are stored in lower precision and only the low-rank adapters are updated in higher precision. These techniques are widely used to adapt large pretrained models on consumer hardware.
Conclusion¶
In this notebook we reviewed the role of the ImageNet dataset and the evolution of convolutional architectures from AlexNet and VGG to Inception, and then to designs that address training and efficiency: batch normalization for stable activations, ResNet and skip connections for very deep networks, and later architectures such as DenseNet, MobileNet, and EfficientNet. We set up CIFAR-10 and implemented batch normalization and residual blocks, and we used a pretrained MobileNet for transfer learning and fine-tuning on CIFAR-10. The same ideas: transferring representations, freezing and unfreezing layers, etc. apply beyond vision to other domains where large pretrained models are available.