Mask R-CNN

Mask R-CNN is an advanced deep learning model for object detection and instance segmentation that extends Faster R-CNN by adding a parallel branch for pixel-level mask prediction. It not only detects objects and draws bounding boxes but also generates precise segmentation masks for each object.

Extends Faster R-CNN by adding a mask prediction branch for each Region of Interest (RoI).
Performs object detection and instance segmentation simultaneously with pixel-level accuracy.

Instance Segmentation

Instance segmentation is a computer vision task that identifies and separates each object in an image while assigning a unique pixel-level mask to every individual instance. It provides both object detection and precise object boundaries at the pixel level.

Separates each object instance individually within an image.
Assigns pixel-level masks to classify object regions with accurate boundaries.

Architecture

Mask R-CNN extends Faster R-CNN by adding a parallel branch for predicting segmentation masks along with object detection outputs.

1. Backbone Network

The backbone network extracts feature maps from the input image using deep CNN architectures such as ResNet-C4 and ResNet-FPN.

Uses deep convolutional networks for feature extraction
Feature Pyramid Network (FPN) improves multi-scale detection
Produces feature maps such as P2, P3, P4, P5, and P6

2. Region Proposal Network

The RPN generates candidate object regions from convolutional feature maps.

Uses 3×3 convolution layers to generate proposals
Predicts objectness scores and bounding box coordinates
Uses anchor boxes of different scales and aspect ratios
Identifies potential object locations efficiently

3. Mask Representation

The mask branch predicts segmentation masks for each Region of Interest (RoI).

Uses a Fully Convolutional Network (FCN) for pixel-level prediction
Preserves spatial structure of features
Generates an m×m mask for each object class
Uses RoI Align for accurate mask generation

4. RoI Align

RoI Align is used to extract fixed-size feature maps from region proposals while preserving exact spatial alignment. It improves RoI Pooling by removing quantization and ensuring pixel-accurate feature mapping, which is important for mask prediction.

Takes the feature map from the previous convolution layer and divides it into an M × N grid without rounding or integer approximation.
Uses bilinear interpolation to compute exact feature values at sampled locations.
Produces fixed-size feature maps for each Region of Interest, improving segmentation accuracy.

Working

Mask R-CNN extends Faster R-CNN by adding a parallel mask prediction branch, enabling both object detection and instance segmentation in a single unified pipeline.

Uses a Region Proposal Network (RPN) to generate candidate object regions.
Extracts region features using RoI Align for precise spatial alignment.
Performs object classification using a softmax classifier to assign class labels.
Applies bounding box regression to refine object localization.
Generates pixel-level segmentation masks through a dedicated mask branch and outputs final predictions.

Applications

Medical Imaging: Used for tumor detection, organ segmentation, and anomaly identification in scans like MRI and CT.
Autonomous Vehicles: Helps detect and segment pedestrians, vehicles, and road objects for safe driving.
Surveillance Systems: Supports object tracking and activity monitoring in security footage.
Image Editing & AR: Enables object removal, background editing, and augmented reality effects.
Aerial Imaging: Used in drones and satellite images for mapping and object detection.

Advantages

Reduces computational cost compared to exhaustive search methods
Flexible architecture that supports different backbone networks
Achieves state-of-the-art performance in instance segmentation tasks

Limitations

Requires high computing resources such as GPUs
Needs detailed pixel-level annotated datasets for training
Training and inference can be slower compared to simpler detection models
Less suitable for real-time applications with strict latency requirements