MongoDB $substrCP Operator

Last Updated : 16 Apr, 2026

The $substrCP operator in MongoDB extracts substrings based on Unicode code points within the aggregation pipeline, ensuring correct handling of both ASCII and non-ASCII characters for multilingual text processing.

  • Extracts substrings using Unicode code point index and length.
  • Handles multibyte and non-ASCII characters correctly.
  • Designed for use in aggregation stages (e.g., $project, $addFields).
  • Avoids byte-level slicing issues seen with byte-based substring methods.
  • Suitable for multilingual and special-character text manipulation.

Syntax

{ $substrCP: [ <string expression>, <code point index>, <code point count> ] }
  • string expression: Accepts strings with alphabetic, alphanumeric, and special characters as input for substring extraction.
  • code point index: It is a non-negative integer that represents the starting point of the substring
  • code point count: Non-negative integer specifying the number of characters that need to be taken from the code point index.

Importance of $substrCP

Here are some importance discussed below:

  • Works seamlessly with non-ASCII characters (e.g., Chinese, Arabic, emojis, etc.).
  • Uses Unicode code points instead of byte positions, ensuring accuracy.
  • Supports multibyte character sets effectively.
  • Enhances data processing in multilingual applications.
  • Helps extract portions of text fields for analysis, filtering, and transformations.

Examples of MongoDB $substrCP Operator

To understand MongoDB $substrCP Operator we need a collection on which we will perform various operations and queries.

  • Database: GeeksforGeeks
  • Collection: articles
  • Documents: Three documents that contain the details of the articles in the form of field-value pairs.
Screenshot-2026-02-14-161330

Example 1: Using $substrCP operator

Extract publicationmonth and publicationyear from publishedon.

db.articles.aggregate([
{
$project: {
articlename: 1,
publicationmonth: { $substrCP: ["$publishedon", 0, 4] },
publicationyear: { $substrCP: ["$publishedon", 4, 4] }
}
}
])

Output:

Screenshot-2026-02-14-161438
  • "publicationmonth" extracts the first 4 characters of publishedon, representing the year.
  • "publicationyear" extracts the remaining characters by using $subtract to calculate the length dynamically.

Example 2: Single-Byte Character Set

Create a new field shortName with only the first 10 characters of each article's name. This is useful for displaying short previews of article titles.

db.articles.aggregate([
{
$project: {
articlename: 1,
shortName: {
$substrCP: ["$articlename", 0, 10]
}
}
}
])

Output:

Screenshot-2026-02-14-161836
  • $substrCP extracts a substring starting from index 0 (first character) and taking 10 characters from articlename.
  • The resulting shortName contains the first 10 characters, which can be used as a preview or snippet of the full title.
  • This approach ensures correct handling of Unicode characters, preventing any corruption in case of multibyte characters.

Example 3: Handling Multibyte Character Set

Suppose another document in the articles collection has an articlename in a Multibyte Character Set.

db.articles.aggregate([
{
$project: {
shortName: { $substrCP: ["$articlename", 0, 15] }
}
}
])

Output:

Screenshot-2026-02-14-162132

$substrCP ensures that characters are correctly extracted even if they are multibyte characters, preventing data corruption.

Important Points About MongoDB $substrCP Operator

Here are some important points:

  • $substrCP extracts substrings in the aggregation pipeline using Unicode code point index and length.
  • Safely handles non-ASCII/multibyte characters.
  • Designed for efficient Unicode-aware string manipulation in aggregation.
Comment

Explore