Mask R-CNN is an advanced deep learning model for object detection and instance segmentation that extends Faster R-CNN by adding a parallel branch for pixel-level mask prediction. It not only detects objects and draws bounding boxes but also generates precise segmentation masks for each object.
- Extends Faster R-CNN by adding a mask prediction branch for each Region of Interest (RoI).
- Performs object detection and instance segmentation simultaneously with pixel-level accuracy.
Instance Segmentation
Instance segmentation is a computer vision task that identifies and separates each object in an image while assigning a unique pixel-level mask to every individual instance. It provides both object detection and precise object boundaries at the pixel level.
- Separates each object instance individually within an image.
- Assigns pixel-level masks to classify object regions with accurate boundaries.

Architecture
Mask R-CNN extends Faster R-CNN by adding a parallel branch for predicting segmentation masks along with object detection outputs.
1. Backbone Network
The backbone network extracts feature maps from the input image using deep CNN architectures such as ResNet-C4 and ResNet-FPN.
- Uses deep convolutional networks for feature extraction
- Feature Pyramid Network (FPN) improves multi-scale detection
- Produces feature maps such as P2, P3, P4, P5, and P6
2. Region Proposal Network
The RPN generates candidate object regions from convolutional feature maps.
- Uses 3×3 convolution layers to generate proposals
- Predicts objectness scores and bounding box coordinates
- Uses anchor boxes of different scales and aspect ratios
- Identifies potential object locations efficiently
3. Mask Representation
The mask branch predicts segmentation masks for each Region of Interest (RoI).
- Uses a Fully Convolutional Network (FCN) for pixel-level prediction
- Preserves spatial structure of features
- Generates an m×m mask for each object class
- Uses RoI Align for accurate mask generation
4. RoI Align
RoI Align is used to extract fixed-size feature maps from region proposals while preserving exact spatial alignment. It improves RoI Pooling by removing quantization and ensuring pixel-accurate feature mapping, which is important for mask prediction.
- Takes the feature map from the previous convolution layer and divides it into an M × N grid without rounding or integer approximation.
- Uses bilinear interpolation to compute exact feature values at sampled locations.
- Produces fixed-size feature maps for each Region of Interest, improving segmentation accuracy.
Working
Mask R-CNN extends Faster R-CNN by adding a parallel mask prediction branch, enabling both object detection and instance segmentation in a single unified pipeline.
- Uses a Region Proposal Network (RPN) to generate candidate object regions.
- Extracts region features using RoI Align for precise spatial alignment.
- Performs object classification using a softmax classifier to assign class labels.
- Applies bounding box regression to refine object localization.
- Generates pixel-level segmentation masks through a dedicated mask branch and outputs final predictions.
Applications
- Medical Imaging: Used for tumor detection, organ segmentation, and anomaly identification in scans like MRI and CT.
- Autonomous Vehicles: Helps detect and segment pedestrians, vehicles, and road objects for safe driving.
- Surveillance Systems: Supports object tracking and activity monitoring in security footage.
- Image Editing & AR: Enables object removal, background editing, and augmented reality effects.
- Aerial Imaging: Used in drones and satellite images for mapping and object detection.
Advantages
- Reduces computational cost compared to exhaustive search methods
- Flexible architecture that supports different backbone networks
- Achieves state-of-the-art performance in instance segmentation tasks
Limitations
- Requires high computing resources such as GPUs
- Needs detailed pixel-level annotated datasets for training
- Training and inference can be slower compared to simpler detection models
- Less suitable for real-time applications with strict latency requirements