Computer Vision Interview Questions

Computer vision is a field of artificial intelligence that enables machines to interpret and understand visual information from the world. It encompasses a wide range of tasks such as image classification, object detection, image segmentation and image generation. As the demand for advanced computer vision applications grows, so does the need for skilled professionals who can develop and implement these technologies effectively.

1. Explain the concept of pixels and image resolution.

A pixel (short for picture element) is the smallest unit of a digital image. Each pixel holds a value representing color or intensity and when combined with millions of other pixels, it forms a complete image. Image resolution, on the other hand, refers to the amount of detail an image holds. It is usually expressed as the number of pixels along the width and height of an image (e.g., 1920×1080) or as pixel density (pixels per inch, PPI). Higher resolution means more pixels which generally results in clearer and sharper images.

Each pixel may represent grayscale intensity (0–255) or color values (RGB).
Resolution can be measured in: Spatial resolution (width × height in pixels) and Pixel density (PPI or DPI for print).
High resolution provides more detail but requires more storage and processing.
Low resolution results in pixelation when images are enlarged.
Common resolutions: 720p (HD), 1080p (Full HD), 4K (Ultra HD).

2. Explain the 2D Discrete Fourier Transform (DFT).

The 2D Discrete Fourier Transform (DFT) is a way to convert a 2D image from its spatial form (pixels) into the frequency domain. In the frequency domain, we can see which patterns or details (like edges or textures) are present in the image. Each value in the DFT tells us the strength (amplitude) and orientation (phase) of a specific frequency component in the image. This is very useful for tasks like filtering, compression and detecting patterns.

The 2D DFT of an image f(x,y) of size M \times N is:

F(u,v) = \sum_{x=0}^{M-1} \sum_{y=0}^{N-1} f(x,y) \cdot e^{-j 2 \pi \left(\frac{ux}{M} + \frac{vy}{N}\right)}

Converts the image from the spatial domain to the frequency domain.
Each F(u,v) represents a specific frequency’s magnitude and phase.
Useful for low-pass or high-pass filtering, edge detection and image compression.
Direct computation of DFT is slow for large images.

3. How does the Fast Fourier Transform (FFT) improve over DFT?

The Fast Fourier Transform (FFT) is a faster way to compute the DFT. Normally, calculating DFT for an N \times N image takes O(N^4) operations in 2D. FFT reduces this to O(N^2 logN) which is much faster. It does this by breaking the problem into smaller parts and reusing calculations. This makes FFT very practical for real-time image and signal processing.

FFT makes DFT computation much faster, especially for large images.
Most software libraries (like NumPy or MATLAB) use FFT under the hood.
Allows real-time processing for audio, video and images.
Preserves the same accuracy as DFT while improving efficiency.

4. What is convolution in image processing and why is it important?

Convolution is a fundamental operation in image processing where a small matrix, called a kernel or filter, is applied over an image to extract certain features or modify the image. It works by sliding the kernel across the image and performing element-wise multiplication followed by summation to produce a new value for each pixel. Convolution is important because it allows us to perform essential tasks such as blurring, sharpening, edge detection and feature extraction in a systematic and efficient way. Most image processing and computer vision techniques rely on convolution for analyzing patterns in images.

A kernel (or filter) is a small matrix like 3×3 or 5×5 that defines the operation (e.g., smoothing, detecting edges).
The convolution operation at pixel (x,y) is:

g(x,y) = \sum_{i=-k}^{k} \sum_{j=-k}^{k} f(x-i, y-j) \cdot h(i,j)

Where f(x,y) is the input image, h(i,j) is the kernel and g(x,y) is the output image.

Convolution can be used for:

Smoothing/Blurring: Reduces noise and softens images.
Sharpening: Highlights edges and fine details.
Edge Detection: Finds boundaries in images (e.g., Sobel, Prewitt filters).
Feature Extraction: Helps in object recognition and deep learning.

5. What is correlation and how does it differ from convolution?

Correlation is a technique used to measure the similarity between two signals or images. In image processing, it involves sliding a small template or kernel over an image and computing a similarity measure at each position. High correlation values indicate that the pattern in the kernel closely matches the region in the image. Correlation is commonly used in template matching, pattern recognition and feature detection.

Feature	Convolution	Correlation
Definition	Combines an image and a kernel by flipping the kernel and summing products	Measures similarity between an image and a kernel without flipping
Kernel Orientation	Kernel is rotated 180° before applying	Kernel is used as-is
Application	Used for filtering, edge detection, blurring, feature extraction	Used for template matching, pattern recognition, detecting similarities
Effect on Image	Can produce results like blurring or sharpening	Produces similarity map indicating where the template matches best

Both operations are linear and widely used in image processing.
In practice, convolution and correlation often give similar results when the kernel is symmetric.

6. What are linear and non-linear filters? Give examples.

Filters are used in image processing to modify an image, either to enhance features, remove noise or detect edges. Filters are classified into linear and non-linear based on how the output pixel is computed from its neighborhood.

Linear Filters: A linear filter computes each output pixel as a weighted sum of its neighboring pixels. These filters follow the principles of linearity and superposition, meaning the output changes proportionally to the input. Linear filters are mainly used for smoothing, sharpening and edge detection.

Formula (2D linear filter):

g(x,y) = \sum_{i=-k}^{k} \sum_{j=-k}^{k} f(x-i, y-j) \cdot h(i,j)

Where:

f(x,y) = input image
h(i,j) = filter/kernel
g(x,y) = output image

Examples:

Averaging filter: Smooths image by averaging neighbors.
Gaussian filter: Smooths image using Gaussian weights (reduces noise).
Laplacian filter: Highlights edges using second-order derivatives.

Pros: Simple, efficient, good for smoothing/sharpening.

Cons: Can blur edges and fine details.

Non-Linear Filters: A non-linear filter computes each output pixel using a non-linear function of neighboring pixels. These filters are effective for noise removal while preserving edges and details.

Examples:

Median filter: Replaces pixel with median of neighbors (removes salt-and-pepper noise).
Max/Min filter: Replaces pixel with maximum/minimum value in neighborhood.
Bilateral filter: Smooths image while keeping edges sharp (considers spatial and intensity differences).

Pros: Preserves edges, effective against impulsive noise.

Cons: Slightly more computationally expensive than linear filters.

7. Explain Gaussian filtering and its purpose.

Gaussian filtering is a type of linear smoothing filter used to reduce noise and blur an image in a controlled way. It uses a Gaussian function to assign weights to neighboring pixels, giving more importance to pixels near the center and less to those farther away. This weighted averaging preserves the general structure of the image while effectively removing high-frequency noise. Gaussian filtering is widely used in image preprocessing, edge detection and computer vision tasks because it smooths images without introducing sharp artifacts.

Gaussian function (1D) formula:

G(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \, e^{-\frac{x^2}{2 \sigma^2}}

2D Gaussian function (used for images):

G(x,y) = \frac{1}{2 \pi \sigma^2} \, e^{-\frac{x^2 + y^2}{2 \sigma^2}}

Where \sigma controls the spread of the Gaussian (larger \sigma → more blurring).

Purpose of Gaussian filtering:

Reduces random noise in images.
Smooths images while preserving edges better than a simple average filter.
Acts as a preprocessing step for edge detection algorithms like Sobel or Canny.
Helps in scale-space analysis in computer vision.

8. What are some commonly used image enhancement techniques?

Image enhancement involves improving the visual appearance of an image or making it easier to analyze. The goal is to highlight important features, improve contrast, reduce noise and make details more visible. Enhancement can be applied in the spatial domain (directly on pixels) or the frequency domain (using transforms like Fourier). Let's see some commonly used enhancement techniques,

1. Contrast Enhancement

Improves the difference between dark and bright regions.
Makes features more distinguishable in an image.
Example: Histogram equalization redistributes pixel intensities for better contrast.

2. Brightness Adjustment

Modifies the overall lightness of the image.
Achieved by adding or subtracting constant values to pixel intensities.
Helps in making dark images clearer or reducing overexposure.

3. Smoothing (Noise Reduction)

Reduces unwanted noise in an image while preserving structures.
Useful for pre-processing before edge detection or segmentation.
Examples: Mean (average) filter, Gaussian filter, Median filter.

4. Sharpening

Enhances edges and fine details in an image.
Makes features clearer and more defined.
Examples: Laplacian filter, Unsharp masking.

5. Edge Enhancement

Highlights boundaries between objects or regions in an image.
Helps in feature extraction and object detection.
Often used in combination with edge detection algorithms like Sobel or Canny.

6. Histogram Processing

Adjusts the intensity distribution of an image.
Improves visual quality by stretching or equalizing histograms.
Examples: Histogram equalization, Histogram stretching.

7. Color Enhancement

Improves color balance, saturation or hue.
Enhances visual appeal or clarifies features in colored images.
Used in photography, medical imaging and remote sensing.

8. Frequency Domain Enhancement

Applies filters in the frequency domain to improve image quality.
Can remove noise or enhance details based on frequency components.
Examples: High-pass filtering for edges, Low-pass filtering for noise reduction.

9. What is histogram equalization and how does it enhance images?

Histogram equalization is an image enhancement technique that improves the contrast of an image by redistributing its intensity values. It spreads out the most frequent intensity values across the entire range, making dark regions brighter and bright regions darker when necessary. This helps to reveal hidden details and makes features in the image more distinguishable, especially in low-contrast or underexposed images.

Works by transforming pixel intensities based on the cumulative distribution function (CDF) of the histogram.
Enhances global contrast of the image without changing its spatial information.
Particularly effective for images where the histogram is concentrated in a narrow intensity range.
Can be applied to grayscale images or each channel separately in color images.
Variants like Adaptive Histogram Equalization (AHE) or CLAHE work locally to avoid over-enhancement.
Improves visibility of details and features in underexposed or low-contrast regions.

10. Explain the concept of color correction and its applications.

Color correction is an image enhancement technique that adjusts the colors of an image to make them appear more natural, accurate or visually appealing. It is used to compensate for lighting conditions, sensor limitations or color casts caused by environmental factors. The goal is to ensure that objects in the image have the correct color representation and maintain consistent color balance across different images or scenes.

Adjusts color balance, saturation and hue to correct visual inconsistencies.
Compensates for color casts caused by lighting conditions, e.g., tungsten or fluorescent lighting.
Ensures that colors are represented accurately for human perception or machine analysis.
Can be applied globally to the whole image or locally to specific regions.
Widely used in photography, cinematography, broadcasting and image preprocessing for computer vision.
Techniques include white balance adjustment, gamma correction and color grading.
Improves visual appeal and ensures consistency across multiple images or frames.

11. What are the different types of noise that can occur in images?

Noise in images refers to unwanted random variations in pixel intensity which can degrade image quality and affect analysis. Noise can be introduced during image acquisition, transmission or compression. Different types of noise have distinct characteristics and require different filtering techniques for removal.

1. Gaussian Noise

Random variations in intensity following a Gaussian (normal) distribution.
Appears as grainy texture, especially in low-light images.
Common in sensor readings and electronic imaging devices.

2. Salt-and-Pepper Noise

Random occurrences of black and white pixels in an image.
Also called impulse noise.
Often caused by transmission errors or faulty sensors.

3. Speckle Noise

Multiplicative noise that appears as granular interference.
Common in radar, ultrasound and coherent imaging systems.

4. Poisson Noise

Noise whose variance is proportional to the signal intensity.
Arises from photon counting in imaging sensors.
Often seen in low-light imaging conditions.

5. Quantization Noise

Caused by rounding errors during analog-to-digital conversion.
Introduces small fluctuations in intensity values.

6. Periodic Noise

Appears as repetitive patterns or lines in an image.
Often caused by electrical interference or mechanical vibrations during acquisition.
Noise reduction methods depend on the type of noise present.
Linear filters like Gaussian blur work well for Gaussian noise.

12. Explain different noise reduction techniques.

Noise reduction techniques are used to remove unwanted variations in pixel intensity while preserving important image details like edges and textures. Different filters are effective for different types of noise.

1. Gaussian Filter

A linear smoothing filter that reduces random noise using a weighted average of neighboring pixels.
Gives more weight to pixels near the center and less to distant pixels.
Effective for Gaussian noise but can blur edges.
Kernel size and standard deviation (\sigma) control the amount of smoothing.

2. Median Filter

A non-linear filter that replaces each pixel with the median of its neighbors.
Very effective for removing salt-and-pepper noise.
Preserves edges better than linear smoothing filters.
Can be applied with different neighborhood sizes (e.g., 3×3, 5×5).

3. Bilateral Filter

Non-linear filter that smooths images while preserving edges.
Considers both spatial proximity and intensity similarity for weighting neighboring pixels.
Reduces noise without blurring edges.
Computationally more intensive than Gaussian or median filters.

4. Non-Local Means (NLM) Filter

Excellent at preserving textures and fine details.
More computationally expensive than local filters.
Gaussian filter is simple and fast but may blur edges.
Median filter is ideal for impulsive noise like salt-and-pepper.

13. What is Principal Component Analysis (PCA) and how is it used in image processing?

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space while retaining most of the important information. In image processing, It is often used to reduce the number of features or pixels, compress images, remove redundancy and extract the most significant patterns. It works by finding the principal components which are directions of maximum variance in the data and projecting the original image onto these components.

PCA identifies the directions (principal components) where the data varies the most.
Reduces the size of image data without losing significant information.
Helps in removing noise and redundant information from images.
Commonly used in face recognition to create eigenfaces.
Can be applied to both grayscale and color images by reshaping them into vectors.

14. What are Affine Transformations in images?

Affine transformations are geometric transformations that preserve points, straight lines and parallelism in an image. They are used to rotate, scale, translate, shear or reflect images while maintaining the general structure. Affine transformations are widely applied in image registration, object detection, image stitching and geometric corrections. They can be represented using matrix multiplication and vector addition, making them computationally efficient for image processing tasks.

Affine transformations preserve collinearity and ratios of distances along a line.
Common operations include translation, rotation, scaling, reflection and shearing.
The transformation of a point (x,y) can be represented as:

\begin{bmatrix} x' \\ y' \end{bmatrix} =\begin{bmatrix} a & b \\ c & d \end{bmatrix}\begin{bmatrix} x \\ y \end{bmatrix} +\begin{bmatrix} t_x \\ t_y \end{bmatrix}

Where (x', y') is the transformed point, a, b, c, d define rotation, scaling or shearing and t_x, t_y define translation.

15. What are geometric transformations in image processing?

Geometric transformations are operations that change the spatial arrangement of pixels in an image. These transformations are used to resize, rotate, translate, warp or map images to a different coordinate system. They are essential for tasks like image registration, object alignment, perspective correction and image stitching.

Transformations alter pixel positions while possibly keeping intensity values unchanged.
Common types include translation, rotation, scaling, reflection, shearing and affine transformations.
Non-linear transformations include perspective (projective) and warping transformations.
Can be represented using matrices for linear transformations:

\begin{bmatrix} x' \\ y' \end{bmatrix} =\mathbf{T} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}

Where T is transformation matrix and (x,y),(x',y') are original and transformed coordinates.

Geometric transformations are used in image registration, rectification and correcting distortions.
Essential in computer vision applications like object tracking, augmented reality and robotic vision.
Can be applied to both grayscale and color images.

16. What are morphological operations and why are they useful?

Morphological operations are image processing techniques that focus on the shape and structure of objects within an image. They analyze and process images using a small shape called a structuring element to probe and transform the objects. These operations are widely used in binary and grayscale images for tasks like noise removal, object segmentation and shape analysis.

Uses of Morphological Operations:

Removes small noise from binary images using opening operations.
Fills small holes or gaps in objects using closing operations.
Highlights object boundaries using the morphological gradient.
Shrinks or enlarges objects with erosion and dilation, respectively.
Helps in feature extraction and preparing images for segmentation.
Useful for preprocessing images in computer vision and pattern recognition tasks.

17. What is the morphological gradient?

The morphological gradient is a morphological operation that highlights the edges or boundaries of objects in an image. It is computed as the difference between the dilation and erosion of an image using a structuring element. This operation emphasizes the transition regions between foreground and background, making it useful for edge detection and shape analysis in both binary and grayscale images.

Uses of Morphological Gradient:

Highlights object boundaries clearly in images.
Useful for edge detection in pre-processing steps.
Helps in segmentation by identifying contours of objects.
Enhances structural details without significantly altering object shapes.
Can be combined with other morphological operations for feature extraction.
Formula:

Gradient = Dilation(f) - Erosion(f)

Where f is the input image.

18. What is an edge in an image?

An edge in an image is a boundary or transition between regions with significant changes in intensity or color. Edges correspond to object boundaries, surface markings or texture changes. Detecting edges is a fundamental step in image processing because it helps identify important structures and shapes within the image.

Appears where intensity changes sharply.
Represents boundaries between objects or regions.
Can be detected using gradient-based or morphological techniques.

19. Explain Sobel and Prewitt edge detectors.

1. Sobel operator: It is a gradient-based edge detection method that detects edges in both horizontal and vertical directions. It uses two 3×3 convolution kernels to compute approximate derivatives along x and y axes. The final edge strength is obtained by combining these gradients.

Horizontal kernel detects vertical edges; vertical kernel detects horizontal edges.
Gradient magnitude:

G = \sqrt{G_x^2 + G_y^2}

Where G_x and G_y are convolutions of the image with the Sobel kernels.

2. Prewitt Edge Detectors: The Prewitt operator is another gradient-based edge detection method similar to Sobel but uses simpler averaging in the kernels. It also uses two 3×3 kernels to detect horizontal and vertical edges.

Gradient magnitude:

G = \sqrt{G_x^2 + G_y^2}

Less robust to noise compared to Sobel.
Faster and simpler to compute.

20. Explain the Canny edge detection algorithm step by step.

The Canny edge detector is a multi-step gradient-based edge detection method designed to detect edges in images accurately and with minimal noise. It combines smoothing, gradient calculation, non-maximum suppression and edge tracking to produce clean edge maps.

Steps in the Canny Algorithm:

Step 1: Noise Reduction

Smooth the image using a Gaussian filter to reduce noise.
Kernel size and standard deviation (\sigma) control smoothing.

I_s = I * G_\sigma

Where I is the input image and G_\sigma is the Gaussian kernel.

Step 2: Gradient Calculation

Compute gradients in the x and y directions using Sobel operators.
Calculate gradient magnitude and direction:

Step 3: Non-Maximum Suppression

Thins the edges by keeping only local maxima in the gradient direction.
Suppresses pixels that are not on the edge ridge.

Step 4: Double Thresholding

Apply two thresholds: high and low.
Classify edges as strong, weak or non-edges based on these thresholds.

Step 5: Edge Tracking by Hysteresis

Connect weak edges to strong edges if they are connected, otherwise discard them.
Produces the final clean edge map.

Combines noise reduction, edge detection and thresholding for accurate edge extraction.

21. What is a feature descriptor in computer vision?

A feature descriptor is a representation of an image region or keypoint that captures distinctive information about its appearance, shape or texture. Feature descriptors are used to describe and match keypoints across images, enabling tasks like object recognition, image matching and tracking. They transform raw pixel information into a compact and robust vector that can be compared across images even under changes in scale, rotation or illumination.

Encodes information about local image patterns, such as edges, corners or textures.
Examples of feature descriptors: SIFT, SURF, ORB, BRIEF, HOG.
Usually represented as a vector of numbers that summarizes the local region around a keypoint.
Enables feature matching between images for applications like panorama stitching or 3D reconstruction.
Can be invariant to scale, rotation and illumination depending on the descriptor type.
Often used in combination with feature detectors (e.g., Harris corner, FAST) to find and describe keypoints.

22. What is Scale-Invariant Feature Transform (SIFT) and how does it work?

Scale-Invariant Feature Transform (SIFT) is a feature detection and description method used in computer vision to identify and describe distinctive keypoints in images. It is invariant to scale, rotation and partially invariant to illumination changes, making it ideal for matching objects across images taken from different viewpoints or under different conditions. SIFT detects keypoints and computes robust feature descriptors for each keypoint that can be used for image matching, object recognition and 3D reconstruction.

Step 1: Scale-space Extrema Detection: Identify potential keypoints by searching for local maxima and minima in the Difference of Gaussian (DoG) images at multiple scales.

Step 2: Keypoint Localization: Refine keypoints by eliminating unstable points with low contrast or poorly defined edges.

Step 3: Orientation Assignment: Assign a dominant orientation to each keypoint based on local gradient directions, making descriptors rotation-invariant.

Step 4: Keypoint Descriptor Generation:

Compute a 128-dimensional vector based on the gradient magnitudes and orientations around the keypoint.
The descriptor captures local appearance patterns robustly.

23. Explain Speeded Up Robust Features (SURF).

Speeded Up Robust Features (SURF) is a fast and robust feature detection and description algorithm in computer vision. It is designed as a computationally efficient alternative to SIFT, providing scale- and rotation-invariant keypoints and descriptors for tasks like image matching, object recognition and tracking.

Detects keypoints using a Hessian matrix-based detector and assigns a dominant orientation for rotation invariance.
Uses integral images to speed up computation and creates 64- or 128-dimensional descriptors based on local Haar wavelet responses.
Faster than SIFT while remaining robust to scale, rotation and illumination changes.

24. What is ORB and how does it compare to SIFT and SURF?

ORB is a fast and efficient feature detection and description algorithm used in computer vision. It combines the FAST keypoint detector with the BRIEF descriptor, adding orientation information to achieve rotation invariance. ORB is designed to be computationally lightweight while maintaining robustness, making it ideal for real-time applications.

Feature	SIFT	SURF	ORB
Speed	Slow	Faster than SIFT	Fastest
Keypoint Detection	Difference of Gaussian (DoG)	Hessian matrix	FAST
Descriptor Type	128-dimensional floating-point	64- or 128-dimensional floating-point	Binary BRIEF
Rotation Invariance	Yes	Yes	Yes
Scale Invariance	Yes	Yes	Partially
Illumination Robustness	High	High	Moderate
Computational Cost	High	Moderate	Low
Applications	High-accuracy matching, object recognition	Image stitching, 3D reconstruction	Real-time tracking, SLAM, mobile applications

25. What is the Histogram of Oriented Gradients (HOG) and how is it used?

Histogram of Oriented Gradients (HOG) is a feature descriptor used in computer vision to represent the local shape and appearance of objects in an image. It works by dividing the image into small cells, computing the gradient orientation in each cell and forming histograms of these orientations. HOG captures the edge and gradient structure of an object, making it effective for object detection, especially for detecting humans, vehicles and other rigid objects.

Divides the image into small spatial regions (cells).
Computes gradients (magnitude and direction) at each pixel.
Forms histograms of gradient orientations for each cell.
Groups cells into blocks for normalization to improve robustness to illumination.
Produces a feature vector representing local shapes and textures.
Widely used in pedestrian detection, object recognition and image classification.

26. Explain template matching and its limitations.

Template matching is a technique in computer vision used to find parts of an image that match a given template or reference pattern. It works by sliding the template over the input image and computing a similarity measure (e.g., cross-correlation) at each position. The location with the highest similarity indicates the best match. Template matching is simple and effective for detecting objects when their size orientation and appearance are known and consistent.

Computes similarity metrics such as sum of squared differences (SSD), cross-correlation or normalized correlation between template and image regions.
Can be applied in grayscale or color images.
Works best for rigid objects with little variation in scale, rotation or lighting.
Can detect single or multiple occurrences of the template in an image.

Limitations:

Sensitive to scale changes—template size must match object size.
Sensitive to rotation and orientation changes.
Sensitive to illumination and contrast variations.
Computationally expensive for large images or multiple templates.
Cannot handle non-rigid or highly deformable objects effectively.

27. What is optical flow? Explain Lucas-Kanade method.

Optical flow is a technique in computer vision that estimates the motion of objects, surfaces or edges between consecutive frames in a video. It represents the apparent motion of pixels as a vector field, showing the direction and magnitude of movement. Optical flow is widely used in motion detection, video analysis, object tracking and robotics.

Represents motion as a velocity vector for each pixel: \mathbf{v} = (u, v) where u and v are horizontal and vertical displacements.
Assumes brightness constancy, i.e., pixel intensity remains constant between frames.
Can be dense (every pixel) or sparse (selected Keypoints).

Lucas-Kanade Method: The Lucas-Kanade method is a sparse optical flow algorithm that estimates motion for a set of keypoints by assuming small motion and constant velocity within a local neighborhood. It solves a set of linear equations for each keypoint to compute the displacement vectors.

Uses a small window around each keypoint to approximate motion.
Solves the optical flow equation using least squares minimization:

I_x u + I_y v = -I_t

Where I_x, I_y are spatial derivatives, I_t is temporal derivative and (u,v) is the flow vector.

Works well for small, local motions and is computationally efficient.
Often combined with pyramidal implementation to handle larger motions.
Used in object tracking, motion estimation and video stabilization.

28. How does a Convolutional Neural Network (CNN) work for image classification?

A Convolutional Neural Network (CNN) is a deep learning model that automatically learns important features from images for classification. It works by processing the image through multiple layers that detect patterns at different levels, from simple edges in early layers to complex shapes in deeper layers and finally outputs probabilities for each class.

Convolutional Layers: Apply filters to detect local features like edges or textures.
Activation Functions: Add non-linearity, e.g., ReLU.
Pooling Layers: Reduce spatial dimensions and computation while retaining important features.
Fully Connected + Softmax Layers: Flatten feature maps and output class probabilities.

29. What are convolutional layers, pooling layers and fully connected layers?

A Convolutional Neural Network (CNN) is a deep learning model that automatically learns important features from images for classification. It works by processing the image through multiple layers that detect patterns at different levels, from simple edges in early layers to complex shapes in deeper layers and finally outputs probabilities for each class.

Convolutional Layers: Convolutional layers are the core feature extractors in a CNN. They apply multiple filters (kernels) across the input image or feature maps to detect local patterns such as edges, textures and shapes. Each filter produces a feature map that highlights where a particular pattern is present, allowing the network to learn important visual characteristics automatically.
Pooling Layers: Pooling layers are used to reduce the spatial dimensions of feature maps while retaining important information. Common pooling operations include max pooling which selects the maximum value in a region and average pooling which computes the average. Pooling helps reduce computational cost, control overfitting and provides a degree of spatial invariance, making the network more robust to translations in the input image.
Fully Connected (FC) Layers: Fully connected layers come after the convolutional and pooling layers and are responsible for high-level reasoning. They flatten the feature maps from previous layers and connect every neuron to output neurons, allowing the network to combine learned features to make predictions. Typically, the final fully connected layer uses a softmax activation to produce class probabilities for image classification tasks.

30. What is the purpose of pooling layers in CNNs?

Pooling layers are used in Convolutional Neural Networks (CNNs) to reduce the spatial dimensions of feature maps while retaining the most important information. By summarizing regions of the input, pooling layers help the network focus on dominant features rather than precise pixel locations. This makes the network more computationally efficient, robust to small translations and less prone to overfitting.

Dimensionality Reduction: Reduces the size of feature maps, lowering computation and memory usage.
Translation Invariance: Makes the network less sensitive to small shifts or distortions in the input.
Feature Emphasis: Highlights the most significant activations (e.g., using max pooling).
Overfitting Control: Simplifies representations which can help prevent overfitting.

31. What is the difference between max pooling and average pooling?

Feature	Max Pooling	Average Pooling
Operation	Selects the maximum value in the pooling window	Computes the average value in the pooling window
Feature Emphasis	Highlights strongest activations	Provides a smoothed representation
Preservation of Details	Preserves prominent edges and textures	Can dilute strong features
Common Usage	Widely used in modern CNNs	Less common in modern CNNs
Effect on Noise	Can ignore weak noisy activations	May be influenced by noise

32. What is dropout in CNNs and why is it used?

Dropout is a regularization technique in Convolutional Neural Networks (CNNs) that helps prevent overfitting by randomly deactivating a fraction of neurons during training. By temporarily “dropping out” neurons, the network is forced to learn redundant and robust feature representations, reducing dependency on any single neuron and improving generalization to new data.

Randomly sets a percentage of neurons’ outputs to zero during training iterations.
Helps prevent overfitting by reducing co-adaptation between neurons.
Encourages the network to learn robust and distributed features.
Usually applied to fully connected layers, but can also be applied to convolutional layers.
During testing, all neurons are active and outputs are scaled appropriately to maintain consistency.

33. What are some famous CNN architectures?

Over the years, several CNN architectures have become milestones in deep learning, each introducing innovations that advanced image classification, feature extraction and efficiency.

LeNet (1990s): One of the first CNNs, designed for handwritten digit recognition (MNIST). Simple architecture with convolution, pooling and fully connected layers.
AlexNet (2012): Popularized deep CNNs for ImageNet classification. Introduced ReLU activations, dropout and data augmentation to reduce overfitting.
VGG (2014): Uses very deep networks with uniform 3×3 convolution filters. Emphasizes depth and simplicity for improved feature learning.
GoogLeNet / Inception (2014): Introduced Inception modules, combining multiple filter sizes in parallel to capture multi-scale features efficiently.
ResNet (2015): Introduced residual connections (skip connections) to train very deep networks without vanishing gradients. Variants include ResNet-50, ResNet-101, ResNet-152.
DenseNet (2017): Connects every layer to all subsequent layers, improving feature reuse and gradient flow.
MobileNet (2017): Optimized for mobile and embedded devices, uses depthwise separable convolutions to reduce computation.
EfficientNet (2019): Scales depth, width and resolution uniformly using compound scaling, achieving high accuracy with fewer parameters.
Xception (2017): Extends Inception by using depthwise separable convolutions, improving computational efficiency.

34. Explain the concept of transfer learning in CNNs.

Transfer learning is a technique in Convolutional Neural Networks (CNNs) where a pre-trained model, trained on a large dataset, is reused for a different but related task. Instead of training a CNN from scratch which requires large datasets and high computation, transfer learning uses the features learned by existing models (like edges, textures and object parts) and adapts them to the new task. This approach significantly reduces training time and improves performance, especially when the new dataset is small.

Uses pre-trained models such as VGG, ResNet or Inception.
Can freeze early layers (feature extractors) and retrain later layers for the new task.
Helps in small dataset scenarios where training a CNN from scratch is impractical.
Commonly applied in image classification, object detection and medical imaging.
Allows rapid development of high-accuracy models without large computational resources.

35. How does data augmentation help improve CNN performance?

Data augmentation is a technique used in Convolutional Neural Networks (CNNs) to artificially increase the size and diversity of the training dataset by applying various transformations to the existing images. By exposing the network to modified versions of the same data, it learns to generalize better, reducing overfitting and improving performance on unseen data.

Common transformations include rotation, flipping, scaling, cropping, translation and brightness adjustments.
Helps the network become invariant to orientation, position and scale of objects.
Reduces overfitting by preventing the model from memorizing the training data.
Simulates real-world variations, improving robustness.
Simple to implement using libraries like TensorFlow, Keras and PyTorch.

36. How do YOLO and SSD object detection models work?

YOLO (You Only Look Once) and SSD (Single Shot MultiBox Detector) are real-time object detection models that predict object locations and class probabilities in a single forward pass of a CNN, making them fast and efficient for practical applications.

YOLO (You Only Look Once): YOLO divides the input image into a grid and predicts bounding boxes and class probabilities for each grid cell. The network treats detection as a single regression problem, allowing it to simultaneously detect multiple objects while being extremely fast. Variants like YOLOv3, YOLOv4 and YOLOv8 have improved accuracy and speed. YOLO is widely used in real-time applications.
SSD (Single Shot MultiBox Detector): SSD detects objects by applying convolutional filters to multiple feature maps of different scales, allowing it to detect objects of varying sizes. It predicts bounding boxes and class scores for multiple default anchor boxes at each location. SSD balances speed and accuracy, often performing better than YOLO on small object detection and is used in autonomous driving, video surveillance and robotics.

37. Explain Region Proposal Networks (RPN) in Faster R-CNN.

Region Proposal Networks (RPN) are a key component of Faster R-CNN, designed to generate candidate object regions (proposals) efficiently for detection. Instead of using external methods like selective search, RPN shares convolutional features with the detection network, allowing the model to propose regions and classify objects in a single unified framework.

Slides a small network over the convolutional feature map of the input image.
At each spatial location, predicts objectness scores (likelihood of an object) and bounding box coordinates.
Uses multiple predefined anchor boxes of different scales and aspect ratios.
Proposals are fed to the Fast R-CNN detection head for classification and bounding box refinement.
Eliminates the need for separate, computationally expensive region proposal methods.
Improves speed and accuracy compared to earlier two-stage detectors.
Widely used in object detection tasks where both speed and precision are important.

38. What is Mask R-CNN and how does it extend Faster R-CNN?

Mask R-CNN is an extension of Faster R-CNN that adds instance segmentation capabilities to object detection. While Faster R-CNN predicts bounding boxes and class labels for objects, Mask R-CNN also predicts a pixel-level mask for each detected object, enabling the network to distinguish individual object instances, even when they overlap.

The key extension of Mask R-CNN is the addition of a mask prediction branch parallel to the existing classification and bounding box branches. To ensure precise mask alignment, it replaces Faster R-CNN’s RoIPool with RoIAlign which avoids quantization errors when mapping regions of interest (RoIs) from the feature map. This allows the network to generate accurate, pixel-level masks for each object.

Adds a parallel branch for predicting object masks.
Uses RoIAlign to maintain exact spatial correspondence between RoIs and feature maps.
Outputs class label, bounding box and segmentation mask for each detected object.
Can handle object detection and instance segmentation in a single framework.
Widely used in autonomous driving, medical imaging and video analysis.

By combining detection and segmentation, Mask R-CNN provides instance-level understanding of images, distinguishing overlapping objects with high precision.

39. What is the difference between semantic segmentation, instance segmentation and panoptic segmentation?

Feature	Semantic Segmentation	Instance Segmentation	Panoptic Segmentation
Definition	Classifies each pixel into a category.	Classifies each pixel into a category and instance.	Combines semantic + instance segmentation for all pixels.
Object Differentiation	Cannot distinguish different instances of the same class.	Distinguishes individual object instances.	Distinguishes instances and labels all pixels.
Output	Pixel-level class labels.	Pixel-level class labels + instance IDs.	Pixel-level class labels + instance IDs for all objects and background.
Use Cases	Road segmentation, medical imaging, satellite imagery.	Detecting multiple people, vehicles or objects separately.	Autonomous driving, scene understanding, complex visual reasoning.
Complexity	Lower than instance and panoptic segmentation.	Higher than semantic segmentation due to instance IDs.	Highest complexity, combines both semantic and instance segmentation.

40. How can image segmentation be performed using K-Means clustering?

K-Means clustering can segment an image by grouping pixels with similar features (like color or intensity) into clusters. Each cluster corresponds to a segment in the image. Here are the proper steps to perform image segmentation using K-Means:

1. Feature Representation:

Represent each pixel as a feature vector.
Commonly, use RGB values or grayscale intensity, optionally including spatial coordinates to consider location.

2. Initialize Clusters:

Choose the number of clusters, K, representing the desired segments.
Randomly initialize K centroids in the feature space.

3. Assign Pixels to Clusters:

For each pixel, compute the distance (e.g., Euclidean) to each centroid.
Assign the pixel to the nearest centroid, forming K clusters.

4. Update Centroids:

Compute the new centroid of each cluster as the mean of all assigned pixels.

5. Iterate Until Convergence:

Repeat the assignment and update steps until centroids do not change significantly or a maximum number of iterations is reached.

6. Generate Segmented Image:

Replace each pixel’s value with the centroid value of its cluster or assign a unique color to each cluster.

7. Post-processing:

Apply smoothing or morphological operations to refine segment boundaries.

41. What are Fully Convolutional Networks (FCNs) for segmentation?

Fully Convolutional Networks (FCNs) are a type of Convolutional Neural Network (CNN) designed specifically for image segmentation. Unlike traditional CNNs used for classification, FCNs replace fully connected layers with convolutional layers, allowing the network to output pixel-level predictions for the entire image. This enables the model to produce segmentation maps where each pixel is assigned a class label.

FCNs take an input image of arbitrary size and produce an output of the same spatial dimensions.
Use encoder-decoder architecture: the encoder extracts features and the decoder upsamples feature maps to the original resolution.
Employ skip connections to combine low-level spatial information with high-level semantic features for precise segmentation.
Output is a pixel-wise class probability map which can be converted to a segmented image by selecting the most probable class per pixel.
Widely used in semantic segmentation tasks such as road segmentation, medical imaging and object segmentation.

FCNs provide an end-to-end trainable framework for segmentation, enabling efficient and accurate pixel-level predictions without the need for manual feature engineering.

42. How would you train a CNN on a small dataset?

Training a Convolutional Neural Network (CNN) on a small dataset can be challenging due to the risk of overfitting and insufficient data to learn robust features. To overcome this, several strategies can be applied to improve generalization and performance.

Data Augmentation: Artificially increase dataset size by applying rotations, flips, scaling, translations, brightness adjustments, etc., to create diverse training samples.
Transfer Learning: Use a pre-trained CNN (like VGG, ResNet or Inception) as a feature extractor and fine-tune its later layers for the small dataset.
Regularization Techniques: Apply dropout, weight decay or early stopping to reduce overfitting.
Simplify the Network: Use a shallower architecture with fewer parameters to match the dataset size.
Cross-Validation: Use k-fold cross-validation to better estimate performance and reduce variance.
Batch Normalization: Helps stabilize learning and allows higher learning rates, improving convergence.
Learning Rate Scheduling: Adjust learning rates dynamically to avoid overfitting and improve training efficiency.

43. What is a Generative Adversarial Network (GAN)?

A Generative Adversarial Network (GAN) is a type of deep learning model used for generating realistic data, such as images, from random noise. It consists of two neural networks— a generator and a discriminator—competing against each other in a game-theoretic setup. The generator tries to create realistic data while the discriminator attempts to distinguish between real and generated data. Through this adversarial process, the generator gradually learns to produce highly realistic outputs.

Generator: Creates synthetic data from random noise, aiming to fool the discriminator.
Discriminator: Evaluates data and predicts whether it is real or generated.
Training is a minimax game where the generator minimizes its loss while the discriminator maximizes its ability to detect fake data.
GANs are widely used in image synthesis, style transfer, data augmentation, super-resolution and deepfake generation.

44. How does the generator and discriminator work in a GAN?

A Generative Adversarial Network (GAN) consists of two neural networks—the generator and the discriminator—that compete in an adversarial framework to produce realistic data.

Generator: The generator takes random noise as input and produces synthetic data (e.g., images). Its goal is to fool the discriminator into believing that the generated data is real. Over training, the generator learns to capture the underlying data distribution and create increasingly realistic outputs.
Discriminator: The discriminator receives both real data from the dataset and fake data from the generator. It outputs a probability indicating whether the input is real or generated. Its goal is to correctly distinguish real data from fake, forcing the generator to improve.

Training Process:

The generator and discriminator are trained alternately.
The generator minimizes the discriminator’s ability to detect fakes.
The discriminator maximizes its accuracy in distinguishing real vs. generated data.
This creates a minimax game where the generator improves by producing more realistic data and the discriminator improves by becoming a better detector.

45. What is a DCGAN and how is it different from a vanilla GAN?

A DCGAN is a type of Generative Adversarial Network (GAN) that uses deep convolutional neural networks in both the generator and discriminator instead of fully connected networks, making it particularly suitable for generating images. By using convolutional layers, DCGANs can capture spatial hierarchies and local structures, producing higher-quality and more realistic images than vanilla GANs.

Feature	Vanilla GAN	DCGAN
Architecture	Fully connected (dense) layers	Convolutional layers in generator and discriminator
Image Quality	Often produces low-quality images	Produces high-quality, realistic images
Stability	Training can be unstable	Improved stability due to convolutional architectures and batch normalization
Downsampling	Uses dense layers for generation	Uses transposed convolutions (upsampling) in generator and convolutions in discriminator
Applications	Simple synthetic data generation	Image synthesis, style transfer, super-resolution, etc.

46. Explain CycleGAN and its use case.

CycleGAN is a type of Generative Adversarial Network (GAN) designed for unpaired image-to-image translation. Unlike traditional GANs that require paired training data (input-output image pairs), CycleGAN can learn mappings between two domains without direct correspondence, using a cycle-consistency loss to ensure that translating an image to the target domain and back reconstructs the original image.

How it Works:

Consists of two generators (A→B and B→A) and two discriminators (one for each domain).
Each generator translates images between domains while each discriminator evaluates if the translation is realistic.
Cycle-consistency loss ensures that an image translated to the other domain and back is similar to the original.

Use Cases:

Style transfer (e.g., turning photographs into paintings).
Season translation (e.g., summer to winter landscapes).
Domain adaptation (e.g., horses to zebras, day to night images).
Medical imaging (e.g., translating MRI scans to CT scans).

47. What are Wasserstein GANs (WGANs) and how do they improve stability?

Wasserstein GANs (WGANs) are a variation of Generative Adversarial Networks designed to improve training stability and convergence. Traditional GANs often suffer from problems like mode collapse, vanishing gradients and unstable training, making it difficult for the generator and discriminator to converge. WGANs address these issues by using the Wasserstein distance (Earth Mover’s distance) as a measure of similarity between the real and generated data distributions, instead of the standard Jensen-Shannon divergence used in vanilla GANs.

Wasserstein Distance: Provides a smooth and continuous loss even when the distributions of real and fake data do not overlap, giving meaningful gradients to the generator.
Critic instead of Discriminator: Replaces the discriminator with a critic network that outputs a real-valued score instead of a probability.
Weight Clipping / Gradient Penalty: Enforces the Lipschitz constraint to ensure the Wasserstein distance is valid, improving training stability.
Reduces Mode Collapse: Encourages the generator to cover the full data distribution, avoiding collapse to a few outputs.
Stable Training: Loss correlates with the quality of generated samples, making it easier to monitor progress.

48. What are Conditional GANs (cGANs) and how do they work?

Conditional GANs (cGANs) are an extension of Generative Adversarial Networks (GANs) that allow the generation of data conditioned on additional information, such as class labels, text or other modalities. Unlike standard GANs which generate data from random noise alone, cGANs take both a noise vector and a conditional input to produce outputs that satisfy the specified condition. This enables controlled generation of images or data according to desired attributes.

How cGANs Work:

The generator receives both a random noise vector and a condition vector (e.g., class label) and generates data that matches the condition.
The discriminator receives the generated or real data along with the same condition and predicts whether the data is real or fake while also respecting the condition.
Training uses the standard GAN adversarial loss, but conditioned on the additional input.
Enables tasks like class-conditioned image generation, text-to-image synthesis and attribute-guided generation.

49. What is a Variational Autoencoder (VAE) and how does it differ from GANs?

A Variational Autoencoder (VAE) is a generative model that learns to represent data in a continuous latent space and generate new data by sampling from this space. It consists of an encoder which maps input data to a probabilistic latent representation and a decoder which reconstructs data from the latent variables. Unlike traditional autoencoders, VAEs impose a probabilistic constraint on the latent space, encouraging smooth and continuous representations suitable for generating new samples.

Encoder outputs mean and variance for each latent variable, defining a probability distribution.
Decoder reconstructs data by sampling from the latent distribution.
Training minimizes a combination of reconstruction loss and KL divergence, ensuring latent space follows a standard normal distribution.
Generates diverse and smooth outputs by sampling from latent space.

Feature	VAE	GAN
Learning Approach	Probabilistic modeling of latent space	Adversarial training (generator vs discriminator)
Output Quality	Often blurry but smooth and diverse	High-quality and realistic images but may suffer from mode collapse
Training Stability	More stable and easier to train	Can be unstable, sensitive to hyperparameters
Latent Space	Explicit, continuous and interpretable	Implicit, learned through adversarial loss
Loss Function	Reconstruction + KL divergence	Adversarial loss (generator tries to fool discriminator)

50. Explain Denoising Autoencoders (DAEs).

A Denoising Autoencoder (DAE) is a type of autoencoder designed to remove noise from input data. Unlike standard autoencoders which learn to reconstruct the input exactly, DAEs are trained to reconstruct the original clean data from a corrupted version. This encourages the network to learn robust and meaningful features rather than merely copying the input.

Input data is intentionally corrupted (e.g., with Gaussian noise, masking or salt-and-pepper noise).
The encoder maps the noisy input to a latent representation.
The decoder reconstructs the clean original data from this representation.
Loss function typically measures reconstruction error between the original clean input and the network output.
Helps in feature learning, image denoising and pretraining for other deep learning tasks.

51. What is a Convolutional Autoencoder (CAE) and what are its applications?

A Convolutional Autoencoder (CAE) is a type of autoencoder that uses convolutional layers instead of fully connected layers to encode and decode image data. By using convolutions, CAEs can efficiently capture spatial hierarchies and local patterns in images, making them particularly suitable for image-related tasks. The encoder compresses the input image into a latent feature map and the decoder reconstructs the image from this representation.

Applications:

Image Denoising: Removing noise from corrupted images.
Dimensionality Reduction: Compressing images while preserving important features.
Anomaly Detection: Detecting unusual patterns by measuring reconstruction error.
Image Compression: Learning compact representations for storage or transmission.
Pretraining for CNNs: Learning feature representations for downstream tasks like classification or segmentation.

52. What is a Vision Transformer (ViT) and how does it work?

A Vision Transformer (ViT) is a deep learning model for image analysis that applies the transformer architecture originally designed for natural language processing, to computer vision tasks. Instead of using convolutions, ViTs process images by splitting them into patches and treating each patch as a sequence token, similar to words in a sentence. This allows the model to capture long-range dependencies and global context in images effectively.

How ViT Works:

Patch Embedding: The input image is divided into fixed-size patches, each flattened and linearly projected into a vector.
Position Encoding: Adds positional information to each patch embedding to retain spatial relationships.
Transformer Encoder: Uses multi-head self-attention and feed-forward layers to model relationships between patches and extract features.
Classification Head: A special [CLS] token summarizes the image representation and is passed through a classifier to predict labels.
ViTs require large datasets or pretraining to perform competitively, as they lack the strong inductive bias of CNNs.

53. What is a Swin Transformer and how is it different from a standard ViT?

A Swin Transformer is a hierarchical vision transformer designed to improve the efficiency and scalability of standard Vision Transformers (ViTs) for computer vision tasks. It introduces a shifted window-based self-attention mechanism which computes attention locally within windows and then shifts the windows between layers to capture cross-window connections, allowing the model to capture both local and global image features efficiently.

Feature	Vision Transformer (ViT)	Swin Transformer
Attention Mechanism	Global self-attention over all image patches	Shifted window-based local attention
Computational Cost	High, scales quadratically with image size	Lower, scales linearly with image size
Feature Hierarchy	Single-scale, fixed-size patch embeddings	Hierarchical, gradually reduces resolution like CNNs
Inductive Bias	Minimal, relies on large datasets	Includes locality and hierarchical structure, better for smaller datasets
Applications	Image classification, object detection	Image classification, detection, segmentation and dense prediction tasks

54. Explain Convolutional Vision Transformer (CvT).

A Convolutional Vision Transformer (CvT) is a hybrid architecture that combines the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). It integrates convolutional layers into the token embedding and attention modules of a transformer, enabling the model to capture local spatial features efficiently while also modeling long-range dependencies through self-attention. This design improves performance, especially on smaller datasets, by providing an inductive bias similar to CNNs.

Applications:

Image classification and recognition.
Object detection and segmentation.
Scenarios where local features and global context are both important.

55. What is CLIP and how does it align text and image representations?

CLIP is a multimodal model developed by OpenAI that learns to associate images with natural language descriptions. It jointly trains an image encoder and a text encoder to map both images and corresponding text into a shared embedding space, enabling the model to understand the relationship between visual and textual information.

How CLIP Works:

Image Encoder: Processes images (often with a CNN or Vision Transformer) to generate image embeddings.
Text Encoder: Processes text (e.g., sentences or captions) using a transformer to generate text embeddings.
Contrastive Learning: During training, CLIP maximizes the similarity between matching image-text pairs and minimizes similarity between non-matching pairs.
Shared Embedding Space: Both images and text are represented in the same high-dimensional space, allowing comparison using cosine similarity.

Applications:

Zero-shot image classification: Classify images without task-specific training.
Text-to-image retrieval: Find images matching a textual query.
Image captioning and search: Match images to descriptive language.

56. What is ALIGN?

ALIGN is a multimodal model developed by Google that, like CLIP, learns to align images and text in a shared embedding space. However, ALIGN is trained on a much larger dataset of noisy image-text pairs collected from the web which allows it to scale to billions of examples and improve robustness. It uses contrastive learning to maximize the similarity of matched image-text pairs while minimizing similarity of mismatched pairs.

57. What is BLIP and what are the differences between CLIP, ALIGN and BLIP?

BLIP is a multimodal model that improves over CLIP by incorporating both contrastive and generative objectives. While CLIP aligns images and text, BLIP adds image-to-text and text-to-image generation tasks during pretraining, allowing the model to learn richer representations that support both retrieval and generation.

Feature	CLIP	ALIGN	BLIP
Developer	OpenAI	Google	Salesforce / Research Labs
Training Data	Tens of millions of curated image-text pairs	Billions of noisy web-scraped image-text pairs	Large-scale image-text datasets with captions
Training Objective	Contrastive learning	Contrastive learning	Contrastive + generative objectives
Architecture	Image encoder (CNN/ViT) + text transformer	Larger image and text transformers	Image encoder + text transformer (supports generation)
Zero-shot Performance	Good	Better due to massive-scale data	Improved via richer multimodal representation
Generative Capability	No	No	Yes, supports image-to-text and text-to-image generation
Use Cases	Zero-shot classification, image-text retrieval	Large-scale zero-shot classification and retrieval	Image captioning, VQA, retrieval and generation
Key Advantage	Aligns text and images effectively	Scales to massive noisy data, robust embeddings	Combines retrieval and generative tasks for richer understanding

58. Difference between spatial filtering and frequency filtering.

Feature	Spatial Filtering	Frequency Filtering
Definition	Processes the image directly in the spatial domain, using kernels/masks on pixel values.	Processes the image in the frequency domain, using Fourier transforms and modifying frequency components.
Operation	Convolution or correlation with a kernel/mask.	Multiplying the Fourier-transformed image with a filter in the frequency domain.
Advantages	Simple, intuitive and works well for local operations like smoothing or edge detection.	Can easily perform global operations, like removing specific frequency noise or enhancing certain patterns.
Examples	Gaussian blur, Sobel filter, median filter	Low-pass filter, high-pass filter, notch filter
Computation	Direct pixel-wise computation	Requires FFT/IFFT transformations

59. Difference between Linear and Non-Linear Filters.

Feature	Linear Filters	Non-Linear Filters
Definition	Filters where the output is a linear combination of input pixel values.	Filters where the output is a non-linear function of input pixels.
Operation	Uses convolution or correlation with a kernel.	Uses operations like median, maximum or morphological functions.
Superposition Principle	Obeys linearity and superposition.	Does not obey superposition.
Noise Handling	Effective for Gaussian noise, less effective for impulse noise.	Effective for impulse noise (salt-and-pepper) and preserving edges.
Examples	Averaging filter, Gaussian filter, Sobel filter	Median filter, morphological filters, adaptive filters
Effect on Edges	Can blur edges while smoothing noise.	Preserves edges better while reducing noise.

60. Difference between image sharpening and image smoothing.

Feature	Image Sharpening	Image Smoothing
Definition	Enhances edges and fine details in an image.	Reduces noise and smooths variations in pixel values.
Purpose	To make edges and textures more prominent.	To remove noise and produce a visually smoother image.
Operation	Emphasizes high-frequency components using filters like Laplacian or high-pass filters.	Suppresses high-frequency components using filters like averaging, Gaussian or median filters.
Effect on Noise	Can amplify noise along with edges.	Reduces or removes noise effectively.
Common Filters	Laplacian, Sobel, Unsharp masking	Gaussian filter, median filter, averaging filter
Applications	Edge enhancement, feature extraction, medical imaging	Noise reduction, preprocessing for analysis, artistic smoothing

61. Difference between erosion and dilation.

Feature	Erosion	Dilation
Definition	Shrinks or erodes object boundaries in a binary image.	Expands or grows object boundaries in a binary image.
Effect on Objects	Reduces size of foreground objects.	Increases size of foreground objects.
Effect on Holes	Enlarges background areas (makes holes bigger).	Shrinks background areas (fills small holes).
Structuring Element	Uses a kernel to remove pixels from object edges.	Uses a kernel to add pixels to object edges.
Applications	Removing small noise, separating objects, thinning.	Filling gaps, connecting components, smoothing object edges.

62. Difference between Sobel, Prewitt and Canny edge detectors.

Feature	Sobel	Prewitt	Canny
Definition	Computes edges by combining derivatives in x and y directions using a weighted kernel.	Computes edges using simple derivatives in x and y directions with uniform kernel.	Multi-stage edge detector using gradient, non-maximum suppression and hysteresis thresholding.
Kernel Size	Typically 3×3, weighted toward center.	Typically 3×3, uniform weights.	Uses gradient calculation (can use Sobel internally) plus additional processing steps.
Noise Sensitivity	Sensitive to noise; smoothing helps.	Sensitive to noise; less robust than Sobel.	Less sensitive due to Gaussian smoothing before edge detection.
Edge Localization	Moderate accuracy in locating edges.	Moderate accuracy; slightly less precise than Sobel.	High accuracy due to non-maximum suppression.
Complexity	Simple, fast	Simple, fast	More complex, slower than Sobel/Prewitt.
Output	Gradient magnitude map	Gradient magnitude map	Thin, precise edges after thresholding.
Applications	Basic edge detection, feature extraction	Basic edge detection, directional edges	Object detection, image segmentation, feature extraction requiring precise edges

63. Difference between Fast R-CNN, Faster R-CNN and Mask R-CNN.

Feature	Fast R-CNN	Faster R-CNN	Mask R-CNN
Region Proposal	Uses external methods (e.g., selective search) to generate region proposals.	Uses Region Proposal Network (RPN) to generate proposals internally.	Uses RPN like Faster R-CNN for region proposals.
Detection Process	Extracts features for each proposed region using RoIPool, then classifies and regresses bounding boxes.	Shares convolutional features between RPN and detection head, faster and more efficient.	Adds a mask prediction branch in parallel to classification and bounding box regression.
Segmentation Capability	No	No	Yes, provides pixel-level instance masks.
Speed	Slower due to external proposal generation.	Faster than Fast R-CNN due to integrated RPN.	Slightly slower than Faster R-CNN due to mask branch.
Output	Class labels + bounding boxes	Class labels + bounding boxes	Class labels + bounding boxes + instance masks
Applications	Object detection	Object detection	Object detection + instance segmentation

64. How would you design a face recognition system from scratch?

Designing a face recognition system involves multiple stages, including data collection, preprocessing, feature extraction and classification. The goal is to accurately identify or verify individuals based on their facial features. Here’s a structured approach:

1. Data Collection

Collect a large dataset of face images with sufficient variation in lighting, pose, expression and background.
Examples: LFW, VGGFace2, CASIA-WebFace or your custom dataset.

2. Face Detection

Detect faces in images to crop and normalize the region of interest.
Common methods: Haar Cascades, HOG + SVM or deep learning-based detectors like MTCNN or RetinaFace.

3. Face Alignment and Preprocessing

Align faces so that eyes, nose and mouth are in standard positions.
Convert images to grayscale or normalize color channels.
Resize images to a fixed size (e.g., 112×112 or 224×224).

4. Feature Extraction

Extract a compact representation (embedding) for each face.

Methods:

Traditional: PCA (Eigenfaces), LDA (Fisherfaces) or LBPH (Local Binary Patterns Histograms).
Deep Learning: CNN-based embeddings (e.g., FaceNet, ArcFace or a custom CNN).

The feature vector should be robust to pose, lighting and expression changes.

5. Feature Matching / Classification

Face Verification: Compare embeddings using a distance metric like cosine similarity or Euclidean distance.
Face Identification: Use a classifier (SVM, k-NN or softmax) trained on embeddings to predict the person’s identity.

6. Training Considerations

Data Augmentation: Apply rotations, flips, brightness adjustments or random crops to improve generalization.

Loss Functions for Deep Models:

Triplet loss (FaceNet)
ArcFace / CosFace (improved angular margin-based losses)

7. System Deployment

Real-time recognition: Optimize for speed using frameworks like TensorRT or OpenCV DNN.
Database management: Store embeddings in a searchable index (e.g., FAISS) for fast retrieval.
Thresholding: Define a distance threshold for verification or rejection.

65. If your CNN model is overfitting, what methods would you use to fix it?

Overfitting occurs when a CNN performs well on the training data but poorly on unseen data, meaning it has memorized training features rather than learning general patterns. Several strategies can reduce overfitting:

1. Data Augmentation

Increase dataset diversity by applying random rotations, flips, translations, scaling, brightness adjustments or other transformations.
Helps the model generalize better by learning invariant features.

2. Regularization Techniques

Dropout: Randomly deactivate neurons during training to prevent co-adaptation.
Weight decay (L2 regularization): Penalizes large weights to encourage simpler models.
Early stopping: Stop training when validation loss stops improving to avoid overfitting.

3. Reduce Model Complexity

Use a smaller network with fewer layers or filters.
Avoid unnecessarily large CNNs for small datasets.

4. Transfer Learning

Use a pretrained CNN and fine-tune only the last few layers.
Reduces the risk of overfitting on small datasets.

5. Batch Normalization: Stabilizes learning and allows higher learning rates, indirectly reducing overfitting.

6. Cross-Validation: Use k-fold cross-validation to estimate model performance and detect overfitting.

7. Increase Dataset Size: Collect more data or use synthetic data generation to provide more examples for training.

66. How would you perform tumor segmentation in MRI scans using K-Means clustering? Explain the steps involved and any preprocessing required.

K-Means clustering is an unsupervised method that can segment tumors based on pixel intensity differences in MRI scans.

1. Preprocessing:

Noise reduction: Apply Gaussian or median filtering.
Intensity normalization: Standardize pixel intensities across scans.
ROI extraction: Focus on brain regions to avoid irrelevant areas.

2. Flatten Image:

Convert the 2D MRI image into a 1D array of pixel intensities for clustering.

3. Apply K-Means:

Choose a suitable number of clusters (K), e.g., 2 or 3 for background, normal tissue and tumor.
Run K-Means to assign each pixel to a cluster based on intensity similarity.

4. Reshape Clusters:

Convert the 1D cluster labels back into the original image shape.

5. Post-processing:

Use morphological operations (opening, closing) to remove small noisy regions.
Optionally, select the cluster corresponding to the tumor based on intensity or size criteria.

It will generate a output with a segmented image highlighting the tumor region for further analysis or classification.