VGGNet - A Deep Convolutional Neural Network for Image Recognition

VGGNet - A Deep Convolutional Neural Network for Image Recognition#

VGGNet , introduced by Karen Simonyan and Andrew Zisserman from the Visual Geometry Group (VGG) at Oxford University, represents a significant milestone in the evolution of Convolutional Neural Networks (CNNs). At its core, VGGNet demonstrated that network depth is crucial for achieving superior performance in visual recognition tasks, a finding that would influence CNN design for years to come.

Tip

Historical Context: In 2014, when VGGNet secured victory at the ILSVRC challenge, the common belief was that deeper networks would be too difficult to train due to vanishing gradients and computational constraints. VGGNet’s success challenged this assumption and paved the way for even deeper architectures like ResNet.

Architecture#

VGGNet employs a systematic stack of convolutional layers using exclusively 3×3 filters with stride 1 and padding 1, interspersed with 2×2 max pooling layers with stride 2. This uniformity makes the architecture conceptually simple.

VGG has multiple variants, and the most popular one is VGG16, which has 16 layers. The full architecture of VGG16 is as follows:

Input: [224 x 224] normalized, 3-channel color image (with color whitening, see section 3.2.1 in AlexNet article)
Conv1_1: Convolutional layer - [3 x 3] kernel x 64 channels + ReLU
Conv1_2: Convolutional layer - [3 x 3] kernel x 64 channels + ReLU
P1: Pooling layer - Max pooling, [2 x 2] kernel, stride = 2
Conv2_1: Convolutional layer - [3 x 3] kernel x 128 channels + ReLU
Conv2_2: Convolutional layer - [3 x 3] kernel x 128 channels + ReLU
P2: Pooling layer - Max pooling, [2 x 2] kernel, stride = 2
Conv3_1: Convolutional layer - [3 x 3] kernel x 256 channels + ReLU
Conv3_2: Convolutional layer - [3 x 3] kernel x 256 channels + ReLU
Conv3_3: Convolutional layer - [1 x 1] kernel x 256 channels + ReLU
P3: Pooling layer - Max pooling, [2 x 2] kernel, stride = 2
Conv4_1: Convolutional layer - [3 x 3] kernel x 512 channels + ReLU
Conv4_2: Convolutional layer - [3 x 3] kernel x 512 channels + ReLU
Conv4_3: Convolutional layer - [1 x 1] kernel x 512 channels + ReLU
P4: Pooling layer - Max pooling, [2 x 2] kernel, stride = 2
Conv5_1: Convolutional layer - [3 x 3] kernel x 512 channels + ReLU
Conv5_2: Convolutional layer - [3 x 3] kernel x 512 channels + ReLU
Conv5_3: Convolutional layer - [1 x 1] kernel x 512 channels + ReLU
P5: Pooling layer - Max pooling [7 x 7] kernel (aggressive downsampling at this stage)
(During training only: Dropout)
FC14: Fully connected layer - (7 x 7 x 512) → 4096
(During training only: Dropout)
FC15: Fully connected layer - 4096 → 4096
FC16: Fully connected layer - 4096 → 1000
Output: 1000-dimensional vector probability distribution (output probability for each dimension) using softmax function

The network progressively increases the number of feature channels after each pooling operation, following a clear doubling pattern:

\[ \text{channels} = \{64 \rightarrow 128 \rightarrow 256 \rightarrow 512 \rightarrow 512\} \]

The spatial dimensions of the feature maps decrease after each pooling layer, while the number of channels increases, creating a characteristic pyramid structure:

\[ \text{spatial dimensions} = \{224 \rightarrow 112 \rightarrow 56 \rightarrow 28 \rightarrow 14 \rightarrow 7\} \]

Despite its apparent simplicity, VGG16 contains approximately 140 million parameters, with the majority concentrated in the first fully connected layer (approximately 102 million parameters). This large parameter count highlights an interesting trade-off in the architecture: while the convolutional layers follow a clean and efficient design, the fully connected layers remain computationally intensive. This issue is later resolved by global average pooling proposed by [1].

Key Design Principles#

The success of VGGNet stems from several key design principles that work in tandem to create a powerful yet conceptually simple architecture. These principles represent new best practices in CNN design. Let us examine each of these design choices and understand their theoretical foundations.

Parameter Reduction using Stacked 3x3 Kernels#

One of the most influential contributions is the demonstration that stacking multiple 3×3 convolution layers can effectively replace larger kernels while reducing the total number of parameters. This principle is based on a fundamental insight about receptive fields in CNNs.

Consider that we stack two 3×3 convolution layers. Each value in the feature map of the first layer represents the summary of the 3×3 region of the input. The second layer then generates a new feature map by applying the same 3×3 convolution to the feature map of the first layer, summarizing the 5×5 region of the input. The receptive field of the second layer (i.e., the region of the input that the second layer can see) is 5×5.

Now, let us compare two cases:

A single 5×5 convolution layer with stride 1
Two stacked 3×3 convolution layers with stride 1

Which one has fewer parameters? The answer is the second case. In fact, a single 5×5 convolution layer has \(5 \times 5 =25\) parameters, while two stacked 3×3 convolution layers have \(2 \times (3 \times 3) = 18\) parameters.

This 28% reduction in parameters comes with an additional benefit: the inclusion of an extra ReLU non-linearity between the convolutions, allowing the network to be deeper.

https://miro.medium.com/v2/resize:fit:1200/1*k97NVvlMkRXau-uItlq5Gw.png — Fig. 65 A schematic representation of the receptive field of two stacked 3x3 convolution layers. The receptive field of the first layer is 3x3, and the receptive field of the second layer is 5x5. The image is taken from https://medium.com/@rekalantar/receptive-fields-in-deep-convolutional-networks-43871d2ef2e9#

VGG-style Data Augmentation#

VGGNet proposed multi-scale data augmentation (Figure 3). In AlexNet, data augmentation was performed by randomly cropping 224×224 input images from normalized images where the height was set to 256 pixels (left half of the figure below). In addition to this, VGGNet randomly crops 224×224 input images from images resized to a different scale with height of 384 pixels (right half of the figure below).

Through this VGGNet-style data augmentation approach of resizing to two scales, VGGNet was able to learn diversity across two scales, leading to improved classification accuracy.

https://cvml-expertguide.net/wp-content/uploads/2021/08/e72850b7f9960fbbd9d51f636963baec.png — Fig. 66 Data augmentation for VGGNet. The image is taken from https://cvml-expertguide.net/#