Mask R-CNN

Last Updated : 26 Jun, 2026

Mask R-CNN is an advanced deep learning model for object detection and instance segmentation that extends Faster R-CNN by adding a parallel branch for pixel-level mask prediction. It not only detects objects and draws bounding boxes but also generates precise segmentation masks for each object.

  • Extends Faster R-CNN by adding a mask prediction branch for each Region of Interest (RoI).
  • Performs object detection and instance segmentation simultaneously with pixel-level accuracy.

Instance Segmentation

Instance segmentation is a computer vision task that identifies and separates each object in an image while assigning a unique pixel-level mask to every individual instance. It provides both object detection and precise object boundaries at the pixel level.

  • Separates each object instance individually within an image.
  • Assigns pixel-level masks to classify object regions with accurate boundaries.
Instance Segmentation

Architecture

Mask R-CNN extends Faster R-CNN by adding a parallel branch for predicting segmentation masks along with object detection outputs.

Mask R-CNN Architecture
Mask R-CNN Architecture

1. Backbone Network

The backbone network extracts feature maps from the input image using deep CNN architectures such as ResNet-C4 and ResNet-FPN.

  • Uses deep convolutional networks for feature extraction
  • Feature Pyramid Network (FPN) improves multi-scale detection
  • Produces feature maps such as P2, P3, P4, P5, and P6
Mask R-CNN backbone architecture
Mask R-CNN backbone architecture

2. Region Proposal Network

The RPN generates candidate object regions from convolutional feature maps.

  • Uses 3×3 convolution layers to generate proposals
  • Predicts objectness scores and bounding box coordinates
  • Uses anchor boxes of different scales and aspect ratios
  • Identifies potential object locations efficiently
Anchor Generation Mask R-CNN
Anchor Generation Mask R-CNN

3. Mask Representation

The mask branch predicts segmentation masks for each Region of Interest (RoI).

  • Uses a Fully Convolutional Network (FCN) for pixel-level prediction
  • Preserves spatial structure of features
  • Generates an m×m mask for each object class
  • Uses RoI Align for accurate mask generation

4. RoI Align

RoI Align is used to extract fixed-size feature maps from region proposals while preserving exact spatial alignment. It improves RoI Pooling by removing quantization and ensuring pixel-accurate feature mapping, which is important for mask prediction.

ROI Align
ROI Align
  • Takes the feature map from the previous convolution layer and divides it into an M × N grid without rounding or integer approximation.
  • Uses bilinear interpolation to compute exact feature values at sampled locations.
  • Produces fixed-size feature maps for each Region of Interest, improving segmentation accuracy.

Working

Mask R-CNN extends Faster R-CNN by adding a parallel mask prediction branch, enabling both object detection and instance segmentation in a single unified pipeline.

  • Uses a Region Proposal Network (RPN) to generate candidate object regions.
  • Extracts region features using RoI Align for precise spatial alignment.
  • Performs object classification using a softmax classifier to assign class labels.
  • Applies bounding box regression to refine object localization.
  • Generates pixel-level segmentation masks through a dedicated mask branch and outputs final predictions.

Applications

  • Medical Imaging: Used for tumor detection, organ segmentation, and anomaly identification in scans like MRI and CT.
  • Autonomous Vehicles: Helps detect and segment pedestrians, vehicles, and road objects for safe driving.
  • Surveillance Systems: Supports object tracking and activity monitoring in security footage.
  • Image Editing & AR: Enables object removal, background editing, and augmented reality effects.
  • Aerial Imaging: Used in drones and satellite images for mapping and object detection.

Advantages

  • Reduces computational cost compared to exhaustive search methods
  • Flexible architecture that supports different backbone networks
  • Achieves state-of-the-art performance in instance segmentation tasks

Limitations

  • Requires high computing resources such as GPUs
  • Needs detailed pixel-level annotated datasets for training
  • Training and inference can be slower compared to simpler detection models
  • Less suitable for real-time applications with strict latency requirements
Comment