Understanding GoogLeNet Model - CNN Architecture

Last Updated : 12 May, 2026

GoogLeNet (Inception V1) is a convolutional neural network designed for efficient image classification. It uses the Inception module to process multiple filter sizes in parallel, improving feature extraction while keeping computation low.

  • Inception modules combine 1×1, 3×3, 5×5 convolutions and pooling in parallel
  • Uses 1×1 convolutions and global average pooling to reduce computation and parameters
  • Designed to achieve high accuracy with efficient use of resources

Key Features of GoogLeNet

1. 1×1 Convolutions

GoogLeNet uses 1×1 convolutions mainly for dimensionality reduction, which reduces computation and the number of trainable parameters while preserving important features.

Example Comparison:

  • Without 1×1 Convolution:(14×14×48)×(5×5×480)=112.9M operation
convulation_1
Without 1×1 Convolution
  • With 1×1 Convolution:(14×14×16)×(1×1×480)+(14×14×48)×(5×5×16)=5.3M operations
convulation_2
With 1×1 Convolution

This results in a major reduction in computation without loss of performance.

2. Global Average Pooling

Instead of fully connected layers, GoogLeNet uses Global Average Pooling, which averages each feature map into a single value.

  • Eliminates large number of parameters
  • Reduces overfitting
  • Improves generalization and accuracy

3. Inception Module

The Inception module is the core building block of GoogLeNet. It applies multiple operations in parallel:

  • 1×1 convolutions
  • 3×3 convolutions
  • 5×5 convolutions
  • 3×3 max pooling

All outputs are concatenated to capture multi-scale features efficiently without increasing computation significantly.

convulation_3
Inception Module

4. Auxiliary Classifiers

To reduce vanishing gradient problems, GoogLeNet uses auxiliary classifiers during training.

Each classifier includes:

  • Average pooling
  • 1×1 convolution
  • Fully connected layers
  • Softmax output

These help stabilize training and improve generalization.

5. Model Architecture

GoogLeNet is a 22-layer deep network (excluding pooling layers) that emphasizes computational efficiency, making it feasible to run even on hardware with limited resources. Below is Layer by Layer architectural details of GoogLeNet.

convulation_4
Layer-by-Layer Inception

The architecture also contains two auxiliary classifier layer connected to the output of Inception (4a) and Inception (4d) layers.

Inception V1 architecture

  • Input Layer: Accepts a 224×224 RGB image
  • Initial Convolutions and Pooling: Applies convolution and max pooling layers to extract low-level features and reduce spatial dimensions
  • Local Response Normalization (LRN): Normalizes feature maps early to improve generalization
  • Inception Modules: Apply 1×1, 3×3, 5×5 convolutions and 3×3 max pooling in parallel, then concatenate outputs to capture multi-scale features
  • Auxiliary Classifiers: Intermediate branches with pooling, convolutions, fully connected layers, and softmax used to improve training stability
  • Final Layers: Uses global average pooling followed by a fully connected layer and softmax for final classification

Performance and Results

  • Winner of ILSVRC 2014 in both classification and detection tasks
  • Achieved a top-5 error rate of 6.67% in image classification
  • An ensemble of six GoogLeNet models achieved 43.9% mAP (mean Average Precision) on the ImageNet detection task
GoogLeNet Classification top-5 Error
GoogLeNet Detection Performance
Comment