ResNet (Residual Neural Networks)

ResNet (Residual Neural Networks)#

Why did simply adding more layers to CNNs (like VGGNet or InceptionNet) fail to yield the expected performance gains—and sometimes even degraded accuracy?

Residual Neural Networks (ResNet) fundamentally changed the landscape of deep CNN training by introducing residual connections (a.k.a. skip connections). By stacking a series of residual blocks, ResNet enabled training CNNs with dozens or even hundreds of layers without succumbing to the vanishing gradient problem. Today, ResNet is considered one of the most important innovations in the history of deep learning, influencing architectures like ResNeXt and even Transformers.

Note

ResNeXt is an improvement over ResNet proposed by the same research group . It widens the residual blocks via grouped convolutions, achieving higher performance without drastically increasing depth.

Introduction and Context#

ResNet was introduced in [1] to address a key challenge at the time: CNNs deeper than about 20 layers were difficult to optimize and often performed worse than shallower counterparts. Despite the success of VGGNet (16 or 19 layers) and InceptionNet, researchers still faced two major issues when pushing CNNs to 50 layers or more:

Degradation Problem: Simply stacking more layers often degraded accuracy, rather than improving it.
Long Training Times: Extremely deep CNNs took a long time to converge, especially if the network was prone to vanishing or exploding gradients.

The ResNet solution was surprisingly simple yet groundbreaking: add skip connections that carry the original inputs across a few layers unmodified, letting the network focus on modeling the residual.

ResNet in Detail#

Why Going Deeper Was Difficult#

Shouldn’t deeper networks always perform better because they have more parameters and expressive power?

In theory, deeper CNNs can capture richer, more complex patterns. However, two issues hindered progress:

Degradation Problem Even with techniques like batch normalization, adding more layers beyond ~20 caused training error to increase, not decrease. This phenomenon was not simply due to overfitting—rather, the deeper network failed to optimize properly.
Longer Training and Vanishing Gradients As more layers are added, gradients can vanish (or explode). Backprop had trouble sending meaningful error signals all the way to early layers, causing them to learn slowly or not at all.

Key Proposal: Residual Learning with Skip Connections#

What if each stack of layers simply learned a correction (residual) to the identity mapping?

A residual block consists of two (or three) convolutions grouped together, plus a skip connection:

Residual Path: A few convolution layers (for example, two 3×3 conv layers) modeling a function $ F(\mathbf{x}) $.
Skip (Identity) Path: A direct path for $\mathbf{x}$ to bypass the convolutions entirely.

At the end of the block, the skip path is added elementwise to the residual path: $$ \mathbf{y} = F(\mathbf{x}) + \mathbf{x}. $$

In PyTorch, you can implement a basic residual block as follows:

import torch
import torch.nn as nn

class BasicBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        out += identity
        out = self.relu(out)
        return out

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[1], line 1
----> 1 import torch
      2 import torch.nn as nn
      4 class BasicBlock(nn.Module):

File ~/miniforge3/envs/applsoftcomp/lib/python3.10/site-packages/torch/__init__.py:367
    365     if USE_GLOBAL_DEPS:
    366         _load_global_deps()
--> 367     from torch._C import *  # noqa: F403
    370 class SymInt:
    371     """
    372     Like an int (including magic methods), but redirects all operations on the
    373     wrapped node. This is used in particular to symbolically record operations
    374     in the symbolic shape workflow.
    375     """

ImportError: dlopen(/Users/skojaku-admin/miniforge3/envs/applsoftcomp/lib/python3.10/site-packages/torch/_C.cpython-310-darwin.so, 0x0002): Symbol not found: __ZN2at3cpu20is_arm_sve_supportedEv
  Referenced from: <D93D20A5-7F4C-32F1-874B-FE7B416FFD91> /Users/skojaku-admin/miniforge3/envs/applsoftcomp/lib/python3.10/site-packages/torch/lib/libtorch_python.dylib
  Expected in:     <616791F0-29C3-3F88-8A88-D072E7E40979> /Users/skojaku-admin/miniforge3/envs/applsoftcomp/lib/libtorch_cpu.dylib

https://www.researchgate.net/publication/364330795/figure/fig7/AS:11431281176036099@1689999593116/Basic-residual-block-of-ResNet.png — Fig. 68 A basic 2-layer residual block (left) vs. a plain block without skip (right). The skip connection allows the input $\mathbf{x}$ to directly add to the block’s output.#

By stacking many such blocks, the network effectively cascades small residual changes across layers. The key benefits are:

Easier Optimization Instead of learning a full mapping $\mathbf{y} = G(\mathbf{x})$, the block learns only the difference $G(\mathbf{x}) - \mathbf{x}$. This decomposition often proves easier to optimize.

Note

If the optimal mapping is close to identity (i.e., the layer isn’t very important), the network can easily “skip” it by learning $F(\mathbf{x}) \approx 0$. If a more complex transformation is needed, the residual path can still learn it. This makes training more robust—the network doesn’t have to work as hard to preserve important information through deep layers.
Ensemble-Like Behavior When you chain $N$ residual blocks, you effectively create numerous paths for gradient flow—some skip many layers, some pass through multiple convolutions. This variety of gradient routes can speed convergence and reduce the risk of vanishing gradients [2].

Fig. 69 The gradient flow in ResNet with skip connections.#
Deeper Without Degradation ResNet-50, -101, and -152 can be trained without suffering the performance drop typical of overly deep “plain” networks.

Bottleneck Blocks for Deep ResNet#

ResNet has some variants depending on the depth. For deep ResNet, the bottleneck design is used to maintain computational efficiency.

https://i.sstatic.net/kbiIG.png — Fig. 70 A bottleneck block of ResNet.#

This bottleneck block consists of three convolutions instead of two, where:

the first $1 \times 1$ conv reduces the feature dimension.
the second $3 \times 3$ conv operates on this reduced dimension.
the third $1 \times 1$ conv restores the dimension.

This approach shrinks the intermediate feature map, saving computational cost while retaining overall representational capacity. It was inspired by InceptionNet’s “bottleneck” idea [3][4].

Tip

ResNet-50, ResNet-101, and ResNet-152 all use bottleneck blocks. While they have more layers, they remain computationally feasible and yield progressively better accuracy on ImageNet.

ResNeXt: A ResNet Improvement#

What if we can widen the residual blocks without drastically increasing overall parameters?

ResNeXt is an evolution of ResNet that:

Splits the bottleneck conv pathway into multiple “cardinality” groups (e.g., 32 groups).
Aggregates those parallel paths (grouped convolutions) back into a single output.

By increasing cardinality (the number of parallel conv groups) instead of just adding more channels or layers, ResNeXt achieves better accuracy with moderate complexity. This approach also draws on the idea of Inception’s multi-branch parallel conv, but unifies them into a single grouped-convolution block.

https://production-media.paperswithcode.com/methods/Screen_Shot_2020-06-06_at_4.32.52_PM.png — Fig. 71 A basic block of ResNeXt, showing multiple grouped-conv “paths” that are aggregated.#

Implementation of ResNet#

Writing ResNet from Scratch in PyTorch

Summary#

Residual Learning ResNet overcame the degradation problem by framing deeper CNNs as a series of residual blocks, each learning a function $ F(\mathbf{x}) $ that is added to $\mathbf{x}$.
Scalability With skip connections, ResNet-50, -101, and -152 exhibit higher accuracy without the optimization collapse typical of deeper plain networks.
Bottleneck & Beyond For high-depth architectures, the bottleneck design $(1\times1 \to 3\times3 \to 1\times1)$ improves efficiency. ResNeXt further extends ResNet by widening these pathways via grouped convolutions.
Lasting Impact Residual connections are now ubiquitous—not just in CNNs but also in Transformers, large-scale language models, U-Nets, and many other architectures. They simplify optimization and significantly improve gradient flow in very deep models.