Understanding GoogLeNet Model - CNN Architecture

GoogLeNet (Inception V1) is a convolutional neural network designed for efficient image classification. It uses the Inception module to process multiple filter sizes in parallel, improving feature extraction while keeping computation low.

Inception modules combine 1×1, 3×3, 5×5 convolutions and pooling in parallel
Uses 1×1 convolutions and global average pooling to reduce computation and parameters
Designed to achieve high accuracy with efficient use of resources

Key Features of GoogLeNet

1. 1×1 Convolutions

GoogLeNet uses 1×1 convolutions mainly for dimensionality reduction, which reduces computation and the number of trainable parameters while preserving important features.

Example Comparison:

Without 1×1 Convolution:(14×14×48)×(5×5×480)=112.9M operation

With 1×1 Convolution:(14×14×16)×(1×1×480)+(14×14×48)×(5×5×16)=5.3M operations

This results in a major reduction in computation without loss of performance.

2. Global Average Pooling

Instead of fully connected layers, GoogLeNet uses Global Average Pooling, which averages each feature map into a single value.

Eliminates large number of parameters
Reduces overfitting
Improves generalization and accuracy

3. Inception Module

The Inception module is the core building block of GoogLeNet. It applies multiple operations in parallel:

1×1 convolutions
3×3 convolutions
5×5 convolutions
3×3 max pooling

All outputs are concatenated to capture multi-scale features efficiently without increasing computation significantly.

4. Auxiliary Classifiers

To reduce vanishing gradient problems, GoogLeNet uses auxiliary classifiers during training.

Each classifier includes:

Average pooling
1×1 convolution
Fully connected layers
Softmax output

These help stabilize training and improve generalization.

5. Model Architecture

GoogLeNet is a 22-layer deep network (excluding pooling layers) that emphasizes computational efficiency, making it feasible to run even on hardware with limited resources. Below is Layer by Layer architectural details of GoogLeNet.

convulation_4 — Layer-by-Layer Inception

The architecture also contains two auxiliary classifier layer connected to the output of Inception (4a) and Inception (4d) layers.

Inception V1 architecture

Input Layer: Accepts a 224×224 RGB image
Initial Convolutions and Pooling: Applies convolution and max pooling layers to extract low-level features and reduce spatial dimensions
Local Response Normalization (LRN): Normalizes feature maps early to improve generalization
Inception Modules: Apply 1×1, 3×3, 5×5 convolutions and 3×3 max pooling in parallel, then concatenate outputs to capture multi-scale features
Auxiliary Classifiers: Intermediate branches with pooling, convolutions, fully connected layers, and softmax used to improve training stability
Final Layers: Uses global average pooling followed by a fully connected layer and softmax for final classification

Performance and Results

Winner of ILSVRC 2014 in both classification and detection tasks
Achieved a top-5 error rate of 6.67% in image classification
An ensemble of six GoogLeNet models achieved 43.9% mAP (mean Average Precision) on the ImageNet detection task

Related Articles
Difference between AlexNet and GoogleNet
Inception V2 and V3