05. Batch Normalization and Residual Blocks¶

23 Mar 2025 /
  28 Feb 2026

Info

The following source was consulted in preparing this material: Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. Dive into Deep Learning. Cambridge University Press. Chapter 8: Modern Convolutional Neural Networks.

Important

The notebook is currently under revision.

Increasing the number of layers in neural networks for learning more advanced functions is challenging due to issues like vanishing gradients. VGGNet (Simonyan and Zisserman, 2015) partially addressed this problem by using repetitive blocks that stack multiple convolutional layers before downsampling with max-pooling. For instance, two consecutive $3 \times 3$ convolutional layers achieve the same receptive field as a single $5 \times 5$ convolution, while preserving a higher spatial resolution for the next layer. In simpler terms, repeating a smaller kernel allows the network to access the same input pixels while retaining more detail for subsequent processing. Larger kernels blur (downsample) the image more aggressively, which can lead to the loss of important details and force the network to reduce resolution earlier in the architecture and stop.

Despite this breakthrough, VGGNet was still limited and showed diminishing returns beyond 19 layers (hence, VGG19 architecture). Another architecture was introduced the same year with the paper of the Inception architecture (Szegedy et al., 2014).It was named Inception because of the internet meme from the infamous Inception movie. If you don't believe this, scroll down the paper for references section and check out the very first reference. Its implementation, GoogLeNet modelA play on words: GoogLeNet 1) was developed by Google researchers, and 2) pays homage to the LeNet architecture., significantly reduced parameter count and leveraged the advantages of the $1 \times 1$ convolution kernel (see the Network in Network paper (Lin et al., 2014) which also introduced global average pooling (GAP) layer). Despite enabling deeper networks with far fewer parameters, Inception did not fully resolve the core training and convergence problems faced by very deep models.

As a consequence, Batch Normalization (Ioffe and Szegedy, 2015) and Residual Networks (He et al., 2015) emerged as two major solutions for efficiently training neural networks as deep as 100 layers and more. We will now set up the data environment and go on discussing the core ideas and implementations of both papers. We had already introduced the CIFAR dataset in our previous notebook.

In [6]:

Copied!





from torchvision import datasets, transforms
from torch.utils.data import DataLoader

DATA_PATH = './data'
BATCH_SIZE = 32

cifar_mean = (0.4914, 0.4822, 0.4465)
cifar_std  = (0.2470, 0.2435, 0.2616)

train_tfms = transforms.Compose([
  transforms.RandomCrop(32, padding=4),
  transforms.RandomHorizontalFlip(),
  transforms.ToTensor(),
  transforms.Normalize(cifar_mean, cifar_std),
])

test_tfms = transforms.Compose([
  transforms.ToTensor(),
  transforms.Normalize(cifar_mean, cifar_std),
])

train_data = datasets.CIFAR10(root=DATA_PATH, train=True,  download=True, transform=train_tfms)
test_data  = datasets.CIFAR10(root=DATA_PATH, train=False, download=True, transform=test_tfms)

train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True,  pin_memory=True, num_workers=2)
test_loader  = DataLoader(test_data,  batch_size=BATCH_SIZE, shuffle=False, pin_memory=True, num_workers=2)
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

DATA_PATH = './data'
BATCH_SIZE = 32

cifar_mean = (0.4914, 0.4822, 0.4465)
cifar_std  = (0.2470, 0.2435, 0.2616)

train_tfms = transforms.Compose([
  transforms.RandomCrop(32, padding=4),
  transforms.RandomHorizontalFlip(),
  transforms.ToTensor(),
  transforms.Normalize(cifar_mean, cifar_std),
])

test_tfms = transforms.Compose([
  transforms.ToTensor(),
  transforms.Normalize(cifar_mean, cifar_std),
])

train_data = datasets.CIFAR10(root=DATA_PATH, train=True,  download=True, transform=train_tfms)
test_data  = datasets.CIFAR10(root=DATA_PATH, train=False, download=True, transform=test_tfms)

train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True,  pin_memory=True, num_workers=2)
test_loader  = DataLoader(test_data,  batch_size=BATCH_SIZE, shuffle=False, pin_memory=True, num_workers=2)

Files already downloaded and verified
Files already downloaded and verified

Note

From machine learning, we know that, it is encouraged to split the data into training, validation (also called dev), and test sets. When the dataset is not large, an 80 : 10 : 10 split is a reasonable ratio for allocation. For larger datasets (e.g. with one million images), it is fine to allocate 95% or more of your data for training. The training set is used to update the model's parameters. The validation set is used for tuning hyperparameters (e.g. testing different learning rates, regularization strengths, etc.). The test split should ideally be used only once to report the final performance of the selected model (e.g. for inclusion in a research paper).

Batch Normalization¶

Info

The following source was consulted in preparing this material: Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. Dive into Deep Learning. Cambridge University Press. Chapter 8.5: Batch Normalization

Batch normalization standardizes the hideen layer activations of a neural network during training. Instead of allowing the distribution of activations to vary freely from batch to batch, the layer normalizes them using statistics computed from the current mini-batch.

Note

The terms batch and mini-batch are often used interchangeably in deep learning, although they are not exactly the same. In the strict sense, a batch refers to the entire training dataset processed in a single update of the model parameters. A mini-batch refers to a smaller subset of the dataset processed together before computing the gradient and updating the model. In practice, most deep learning libraries use the word batch to mean mini-batch. For instance, the parameter batch_size in PyTorch specifies how many samples are processed together in one forward and backward pass, not the entire dataset.

For example, if a dataset contains 50,000 training images and the batch size is set to 128, the model will process 128 images at a time and update the parameters after each group. In this case, the algorithm performs many updates during one pass through the dataset (one epoch). This approach is called mini-batch gradient descent.

For each feature value $x_i$ in the mini-batch, we compute the batch mean $\mu_B$ and batch variance $\sigma_B^2$, and normalize the value by subtracting the mean and dividing by the standard deviation. A small constant $\epsilon$ is added for numerical stability so that the denominator never becomes zero:

$$ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} $$

After this transformation, the normalized values $\hat{x}_i$ have a mean close to $0$ and a variance close to $1$ within that mini-batch. If this normalization were applied alone, it could restrict the representational flexibility of the network. To allow the model to learn the appropriate scale and offset of the activations, batch normalization introduces two learnable parameters: a scaling parameter $\gamma$ and a shifting parameter $\beta$.

$$ BN(x_i) = \gamma \hat{x}_i + \beta $$

These parameters are learned together with the rest of the model during training. If necessary, the network can recover the original distribution of activations by choosing appropriate values for $\gamma$ and $\beta$.

Note

When a mini-batch passes through the network, each neuron produces one value (feature) for every example. Batch normalization looks at these values together. The layer first computes the average value of that feature in the mini-batch and measures how much the values vary. It then shifts and rescales them so that they stay in a similar numerical range. This keeps the activations stable while the network is learning.

If the process stopped here, every feature would always remain normalized, which could make the network too restrictive. Neural networks need the ability to adjust how strongly signals pass through a layer. For this reason batch normalization adds two learnable parameters. One allows the network to stretch the values and the other allows it to shift them. During training the model learns how much normalization is useful and how much it should modify it.

Batch normalization is typically placed after the affine transformation of a layer and before the non-linear activation function. In other words, the linear mapping is applied first, the result is normalized, and only then the activation function is evaluated:

$$ z = \phi(\textrm{BN}(x)). $$

Important

Note that the bias term $b$ is often omitted when batch normalization is used. The reason is that the shifting role of the bias is already provided by the learnable parameter $\beta$ of the batch normalization. In practice, many implementations therefore disable the bias parameter in layers with bias = False.

Training very deep neural networks is difficult because the scale of activations can change significantly from layer to layer during learning. As parameters are updated, the distribution of intermediate activations also shifts. Each layer must continuously adapt to these changes, which slows down optimization and can make training unstable.

We had already seen parameter initialization in our previous notebook. Methods such as He initialization choose weight variances so that signals neither explode nor vanish as they propagate through the network. While these techniques help at the start of training, the distributions of activations can still drift as learning progresses. Batch normalization stabilizes these intermediate activations during training.

Keeping activations within a predictable numerical range makes gradient-based optimization more reliable. Normalization reduces the risk of exploding or vanishing gradients, allowing deeper networks to train effectively. Because the scale of inputs to each layer is controlled, larger learning rates can often be used. Since the statistics are computed from a random mini-batch, a small amount of noise is introduced into the activations, which can improve generalization.

Note

The original paper argued that the method improves training by reducing internal covariate shift. This term refers to the phenomenon where the distribution of activations inside a network changes as the parameters of earlier layers are updated. If the input distribution to a layer keeps shifting, the layer must constantly adapt, which can slow down learning.

Later research has suggested that the primary benefit of batch normalization may not be the reduction of this distribution shift itself. Instead, many studies indicate that normalization improves the geometry of the optimization problem, producing smoother loss surfaces and better-conditioned gradients. This makes gradient-based optimization more stable and allows larger learning rates. As a result, the exact mechanism behind the success of batch normalization is still discussed in the literature, although its practical effectiveness is well established.

We will implement a batch normalization function and compare it with the torch.nn.BatchNorm2d module of PyTorch.

In [9]:

Copied!





import torch

def BatchNorm2d(X, gamma=None, beta=None, eps=1e-5):
    """
    Batch normalization for input (N, C, H, W).
    Statistics are computed per-channel over (N, H, W).
    Variance uses the population estimate (divide by N, not N-1).
    """
    mean = X.mean(dim=(0, 2, 3), keepdim=True)
    var = X.var(dim=(0, 2, 3), unbiased=False, keepdim=True)
    X_hat = (X - mean) / torch.sqrt(var + eps)
    if gamma is not None and beta is not None:
        X_hat = gamma * X_hat + beta
    return X_hat
import torch

def BatchNorm2d(X, gamma=None, beta=None, eps=1e-5):
    """
    Batch normalization for input (N, C, H, W).
    Statistics are computed per-channel over (N, H, W).
    Variance uses the population estimate (divide by N, not N-1).
    """
    mean = X.mean(dim=(0, 2, 3), keepdim=True)
    var = X.var(dim=(0, 2, 3), unbiased=False, keepdim=True)
    X_hat = (X - mean) / torch.sqrt(var + eps)
    if gamma is not None and beta is not None:
        X_hat = gamma * X_hat + beta
    return X_hat

We will now apply this batch normalization layer to vgg11 model of PyTorch, which has the smallest network of VGG architectures. Note that, PyTorch also has vgg11_bn implementation of the same model, which applies batch normalization internally.

Exercise

Use vgg11_bn model and explore its features.

In [11]:

Copied!

from torchvision import models

model = models.vgg11()
model.features[:6]
from torchvision import models

model = models.vgg11()
model.features[:6]

Out[11]:

Sequential(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU(inplace=True)
  (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (4): ReLU(inplace=True)
  (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)

Important

The terms activation, feature, and feature map are often used interchangeably. An activation is the numerical output of a layer, while a feature refers to the same values interpreted as learned representations useful for a task. For example, model.features returns intermediate layer outputs that can be described both as activations and feature maps.

In [12]:

Copied!

X, _ = next(iter(train_loader))

# obtaining the layer activations
with torch.no_grad():
    A = model.features[:3](X) 

A.shape
X, _ = next(iter(train_loader))

# obtaining the layer activations
with torch.no_grad():
    A = model.features[:3](X) 

A.shape

Out[12]:

torch.Size([32, 64, 16, 16])

In [13]:

Copied!

gamma = torch.ones(1, A.shape[1], 1, 1)
beta  = torch.zeros(1, A.shape[1], 1, 1)

BN = BatchNorm2d(A, gamma, beta)

BN.shape
gamma = torch.ones(1, A.shape[1], 1, 1)
beta  = torch.zeros(1, A.shape[1], 1, 1)

BN = BatchNorm2d(A, gamma, beta)

BN.shape

Out[13]:

torch.Size([32, 64, 16, 16])

In [14]:

Copied!

f"Input stats: {A.mean((0,2,3))[:3]} {A.var((0,2,3), unbiased=False)[:3]}"
f"Input stats: {A.mean((0,2,3))[:3]} {A.var((0,2,3), unbiased=False)[:3]}"

Out[14]:

'Input stats: tensor([0.2512, 0.0854, 0.1703]) tensor([0.1009, 0.0094, 0.0352])'

In [15]:

Copied!

f"Input stats after BN: {BN.mean((0,2,3))[:3]}, {BN.var((0,2,3), unbiased=False)[:3]}"
f"Input stats after BN: {BN.mean((0,2,3))[:3]}, {BN.var((0,2,3), unbiased=False)[:3]}"

Out[15]:

'Input stats after BN: tensor([-6.1467e-08,  2.7940e-09, -9.1735e-08]), tensor([0.9999, 0.9989, 0.9997])'

In [16]:

Copied!

import torch.nn as nn

BN_torch = nn.BatchNorm2d(A.shape[1], affine=False)(A)

f"Input stats after PyTorch BN: {BN_torch.mean((0,2,3))[:3]}, {BN_torch.var((0,2,3), unbiased=False)[:3]}"
import torch.nn as nn

BN_torch = nn.BatchNorm2d(A.shape[1], affine=False)(A)

f"Input stats after PyTorch BN: {BN_torch.mean((0,2,3))[:3]}, {BN_torch.var((0,2,3), unbiased=False)[:3]}"

Out[16]:

'Input stats after PyTorch BN: tensor([-8.0094e-08,  2.7008e-08, -3.3062e-08]), tensor([0.9999, 0.9989, 0.9997])'

Running Statistics in Batch Normalization¶

During training, batch normalization computes the mean and variance from the current mini-batch. These statistics are then used to normalize the activations. However, during inference the model may process a single example or a batch that does not represent the full training distribution. If normalization relied only on the current batch, the output of the network could change depending on which samples appear together.

For this reason, batch normalization layers maintain running estimates of the mean and variance observed during training. These estimates approximate the statistics of the full dataset and are used during evaluation. When a model is switched to evaluation mode (e.g. with model.eval() in PyTorch), the stored running statistics are used instead of the statistics of the current batch. This ensures stable and deterministic predictions.

PyTorch implementations automatically maintain running statistic (with track_running_stats). The global mean and variance values are updated during training using an exponential moving average:

$$ \mu_{\text{running}} = (1 - \alpha)\,\mu_{\text{running}} + \alpha\,\mu_{\text{batch}} $$

$$ \sigma^2_{\text{running}} = (1 - \alpha)\,\sigma^2_{\text{running}} + \alpha\,\sigma^2_{\text{batch}}. $$

Note

An Exponential Moving Average (EMA) is a method for smoothing a sequence of values over time. Instead of keeping all past observations, EMA maintains a running estimate that is updated whenever a new value is observed. If $x_t$ is the new value at step $t$ and $m_{t-1}$ is the previous estimate, the updated value is $$ m_t = (1 - \alpha)m_{t-1} + \alpha x_t $$ where $0 < \alpha \le 1$ controls how quickly the estimate reacts to new data. A larger $\alpha$ makes the average respond more strongly to recent values, while a smaller $\alpha$ produces a smoother estimate that changes more slowly. Because older values are repeatedly multiplied by $1-\alpha$ , their influence decays exponentially over time, which is why the method is called an exponential moving average.

In PyTorch the parameter controlling this update is also called momentum. Despite the name, it is unrelated to the momentum used in optimization algorithms. Instead, it determines how quickly the running statistics adapt to new batches. A larger value updates the statistics more aggressively using recent batches. A smaller value averages information over a longer history of batches. The default value in PyTorch is 0.1. During evaluation, batch normalization uses these stored statistics instead of recomputing them from the input batch.

In [17]:

Copied!





def BatchNorm2d(X, running_mean, running_var, training=True, momentum=0.1, eps=1e-5):
  if training:
    mean = X.mean((0,2,3))
    var = X.var((0,2,3), unbiased=False)
    running_mean = (1 - momentum) * running_mean + momentum * mean
    running_var  = (1 - momentum) * running_var  + momentum * var
  else:
    mean = running_mean
    var = running_var
    
  X_hat = (X - mean[None,:,None,None]) / torch.sqrt(var[None,:,None,None] + eps)
  return X_hat, running_mean, running_var
def BatchNorm2d(X, running_mean, running_var, training=True, momentum=0.1, eps=1e-5):
  if training:
    mean = X.mean((0,2,3))
    var = X.var((0,2,3), unbiased=False)
    running_mean = (1 - momentum) * running_mean + momentum * mean
    running_var  = (1 - momentum) * running_var  + momentum * var
  else:
    mean = running_mean
    var = running_var
    
  X_hat = (X - mean[None,:,None,None]) / torch.sqrt(var[None,:,None,None] + eps)
  return X_hat, running_mean, running_var

In [27]:

Copied!





running_mean = torch.zeros(X.shape[1])
running_var = torch.ones(X.shape[1])

for step, (X, _) in enumerate(train_loader):
  batch_mean = X.mean((0,2,3))
  _, running_mean, running_var = BatchNorm2d(X, running_mean, running_var, training=True, momentum=0.1)
  if (step+1) % 25 == 0 or step == 0:
    print(f"Batch {step+1} mean: {batch_mean.detach().cpu().numpy().round(8)}")
    print(f"Running {step+1} mean: {running_mean.detach().cpu().numpy().round(8)}\n")
  if step+1 == 100:
    break
running_mean = torch.zeros(X.shape[1])
running_var = torch.ones(X.shape[1])

for step, (X, _) in enumerate(train_loader):
  batch_mean = X.mean((0,2,3))
  _, running_mean, running_var = BatchNorm2d(X, running_mean, running_var, training=True, momentum=0.1)
  if (step+1) % 25 == 0 or step == 0:
    print(f"Batch {step+1} mean: {batch_mean.detach().cpu().numpy().round(8)}")
    print(f"Running {step+1} mean: {running_mean.detach().cpu().numpy().round(8)}\n")
  if step+1 == 100:
    break

Batch 1 mean: [-0.28676817 -0.24872124 -0.1678364 ]
Running 1 mean: [-0.02867682 -0.02487212 -0.01678364]

Batch 25 mean: [-0.21809351 -0.27822074 -0.26022816]
Running 25 mean: [-0.2427671  -0.2524865  -0.20356455]

Batch 50 mean: [-0.20983346 -0.15853268 -0.09818553]
Running 50 mean: [-0.26161012 -0.25976467 -0.21901353]

Batch 75 mean: [-0.18251376 -0.11873064 -0.10977714]
Running 75 mean: [-0.26663187 -0.2610623  -0.23082148]

Batch 100 mean: [-0.32655272 -0.2929147  -0.28516495]
Running 100 mean: [-0.26810375 -0.25653508 -0.23382366]

Note

The running mean converges to the dataset mean because its update is an exponential moving average: $$m_{t+1}=(1-\alpha)m_t+\alpha\,b_t,$$ where $m_t$ is the running mean, $b_t$ is the batch mean at step $t$, and $\alpha$ is the momentum. If batches are sampled from a fixed data distribution with mean $\mu$, then $\mathbb{E}[b_t]=\mu$. Taking expectation of the update gives $$\mathbb{E}[m_{t+1}]=(1-\alpha)\mathbb{E}[m_t]+\alpha\mu.$$ Subtract $\mu$ from both sides to track the error relative to the true mean: $$\mathbb{E}[m_{t+1}]-\mu=(1-\alpha)(\mathbb{E}[m_t]-\mu),$$ so the error shrinks geometrically as $|1-\alpha| < 1$: $$\mathbb{E}[m_t] - \mu = (1-\alpha)^t (m_0 - \mu) \to 0$$ Hence, we get $\mathbb{E}[m_t]\to \mu$.

Important

A rule of thumb is that batch sizes between 50-100 generally work well for batch normalization: the batch is large enough to return reliable statistics but not so large that it causes memory issues or slows down training unnecessarily. Batch size of 32 is usually the lower bound where batch normalization still provides relatively stable estimates. Batch size of $128$ is also effective if the hardware allows, and can produce even smoother estimates. Beyond that the benefit often diminishes.

Layer Normalization¶

If the batch size is very small due to memory limitations, batch Normalization may lose effectiveness because its statistics depend on the current mini-batch. In such situations it is often better to use Layer Normalization (Ba et al., 2016), which does not rely on the batch dimension.

Instead of computing statistics across the batch, layer normalization computes them across the features of each individual sample. For an input vector $x \in \mathbb{R}^d$, the mean and variance are

$$ \mu = \frac{1}{d}\sum_{i=1}^{d} x_i, \qquad \sigma^2 = \frac{1}{d}\sum_{i=1}^{d}(x_i - \mu)^2 $$

The normalized output is

$$ y_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} \gamma_i + \beta_i $$

Because these statistics are computed within each sample, the behavior of the layer is the same during training and inference. Hence, there is no need to track the running statistics.

Example

Suppose a hidden layer produces the vector $ x = [2,\,4,\,6,\,8]. $ Layer normalization computes the mean and variance using only these four values, then rescales the vector so that it is centered and normalized. This happens independently for each sample.

Batch normalization and layer normalization stabilize neural network training, but they work in different ways and are suited to different architectures.

Batch normalization computes statistics across the mini-batch (and across spatial dimensions in CNNs). Because it uses many values to estimate the mean and variance, the estimates are usually stable and it often improves optimization speed and generalization in convolutional networks. However, its behavior depends on the batch size and it requires running statistics during inference.

Layer normalization computes statistics using only the features of a single sample, so it behaves the same during training and inference and does not depend on the batch size. This makes it particularly effective for transformers, sequence models, and settings where batches are small or variable.

In CNNs, batch normalization benefits from the structure of feature maps. A convolutional layer produces an activation tensor of shape $N \times C \times H \times W$ where statistics for each channel using all samples and all spatial positions are computed:

$$ \mu_c = \frac{1}{N H W} \sum_{n=1}^{N} \sum_{h=1}^{H} \sum_{w=1}^{W} x_{n,c,h,w} $$

Because every channel contains many spatial values, the normalization uses a large number of observations. This makes the estimated mean and variance stable and often improves optimization in CNNs even with small batch sizes.

Exercise

Implement layer normalization.

Residual Network¶

Residual neural network (ResNet) consists of repeated residual blocks. Each residual block consists of a residual skip (shortcut) connection.

The idea of the residual connection is simple. The input first passes through two weight layers with an activation in between, producing a transformation $g(x)$. Before applying the final activation, the original input $x$ is added to this transformation: $f(x) = g(x) + x$. The result is then passed through the activation function. This skip connection allows the network to preserve the original signal and makes very deep networks easier to train. In the code below, we will create ResidualBlock by also integrating batch normalization.

In [29]:

Copied!





import torch.nn.functional as F

class ResidualBlock(nn.Module):
  def __init__(self, channels):
    super().__init__()
    self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
    self.bn1 = nn.BatchNorm2d(channels)
    self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
    self.bn2 = nn.BatchNorm2d(channels)

  def forward(self, x):
    out = self.conv1(x)
    out = self.bn1(out)
    out = F.relu(out)
    out = self.conv2(out)
    out = self.bn2(out)
    return F.relu(out + x)
import torch.nn.functional as F

class ResidualBlock(nn.Module):
  def __init__(self, channels):
    super().__init__()
    self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
    self.bn1 = nn.BatchNorm2d(channels)
    self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
    self.bn2 = nn.BatchNorm2d(channels)

  def forward(self, x):
    out = self.conv1(x)
    out = self.bn1(out)
    out = F.relu(out)
    out = self.conv2(out)
    out = self.bn2(out)
    return F.relu(out + x)

ResNet was originally developed for image classification tasks, winning ImageNet competition. Inspired by VGGNet, each of its residual block consisted of two $3 \times 3$ convolutions, both integrating batch normalization, followed by a skip connection. Currently, transformers incorporate residual connections heavily in their design.

Note

Shape mismatch is a common issue when implementing residual blocks. A skip connection performs elementwise addition, so both tensors must have the same shape $(N, C, H, W)$. In the example, no issues occur because the convolutions preserve the input shape. The number of channels stays the same, and the spatial dimensions are kept constant by using a $3 \times 3$ kernel with padding $1$ and stride $1$. Recall that the output size of a convolution (for height/width) is $$ H_{out} = \frac{H + 2p - k}{s} + 1 $$ With $k=3$, $p=1$, and $s=1$, the result is $H_{out}=H$. This means the transformation $g(x)$ has the same dimensions as the input, allowing the skip connection to be added safely.

In deeper networks the shape may intentionally change, for example when increasing the number of channels or reducing spatial resolution. In those cases architectures align the shapes using a $1 \times 1$ convolution on the skip path before performing the addition.

Similar to batch normalization, the advantages of residual connections will be obvious in case of 50 layers or more, with repeated residual blocks. But why adding input of the layer to the second affine transformation boosts training? We are providing a slightly oversimplified intuition below.

Let's take any deep learning model. The types of functions this model can learn depend on its design (e.g. number of layers, activation functions, etc). All these possible functions we can denote as class $\mathcal{F}$. If we cannot learn a perfect function for our data, which is usually the case, we can at least try to appoximate this function as closely as possible by minimizing a loss. We may assume that a more powerful model can learn more types of functions and show better performance. But that's not always the case. To achieve a better performance than a simpler model, our model must be capable of learning not only more functions but also all the functions the simpler model can learn. Simply, the possible function class of the more powerful model should be a superclass of the simpler model's function class $\mathcal{F} \subseteq \mathcal{F}'$. If the $\mathcal{F}'$ is not an expanded version of $\mathcal{F}$, the new model might actually learn a function that is farther from the truth, and even show worse performance.

Refer to the figure above, where our residual output is $f(x) = g(x) + x$. If some activation nodes in our network are unnecessary and increase complexity or learn bad representations, instead of learning weights and biases, our residual block can now learn an identity function $f(x) = x$ by simply setting that nodes parameters to zero. As a result, our inputs will propagate faster while ensuring that the learned functions are within the biggest function domain. By providing an identity path through which gradients can flow directly across layers, skip connections alleviate the vanishing gradient problem.

Residual blocks also resemble dropout. To differentiate, you can imagine that dropout blocks the path, while residual connection allows the network to learn more functions by helping inputs to "jump over" (skip) the nodes. And it is very important that the function classes of the model with residual blocks is a superset of the same model without such blocks. Hence, residual connections allow the model to represent complex functions while retaining easy access to simpler ones through the identity mapping, which improves optimization and mitigates vanishing gradients.