Loading

Using Painless regular expressions

Serverless Stack

This tutorial uses the kibana_sample_data_ecommerce dataset. Refer to Context example data to get started.

The goal is to extract and categorize email domains during document ingestion for marketing segmentation. We will use regular expression pattern matching to parse a set of email addresses and break them into their component parts. The expression /([^@]+)@(.+)/ splits emails at the @ symbol, while domain-specific patterns like (gmail|yahoo|hotmail|outlook|icloud|aol)\.com/ identify personal email providers for automated categorization.

The ingest pipeline allows you to extract structured data from email addresses during indexing. Regular expressions parse email components and categorize domains, creating fields for targeted marketing campaigns and customer analysis.

Create an ingest pipeline that extracts email domains and categorizes them:

PUT _ingest/pipeline/kibana_sample_data_ecommerce-extract_email_domains
{
  "description": "Extract and categorize email domains from customer emails",
  "processors": [
    {
      "script": {
        "lang": "painless",
        "source": """
          // Extract email domain using regex
          Pattern emailPattern = /([^@]+)@(.+)/;
          Matcher emailMatcher = emailPattern.matcher(ctx.email);

          if (emailMatcher.matches()) {
            String username = emailMatcher.group(1);
            String domain = emailMatcher.group(2);

            // Store extracted components
            ctx.email_username = username;
            ctx.email_domain = domain;

            // Categorize domain type using regex patterns
            Pattern personalDomains = /(gmail|yahoo|hotmail|outlook|icloud|aol)\.com/;
            Pattern businessDomains = /\.(co|corp|inc|ltd|org|edu|gov)$/;
            Pattern testDomains = /\.zzz$/;

            if (testDomains.matcher(domain).find()) {
              ctx.email_category = "test";
            } else if (personalDomains.matcher(domain).find()) {
              ctx.email_category = "personal";
            } else if (businessDomains.matcher(domain).find()) {
              ctx.email_category = "business";
            } else {
              ctx.email_category = "other";
            }

            // Extract top-level domain
            Pattern tldPattern = /\.([a-zA-Z]{2,})$/;
            Matcher tldMatcher = tldPattern.matcher(domain);
            if (tldMatcher.find()) {
              ctx.email_tld = tldMatcher.group(1);
            }
          }
        """
      }
    }
  ]
}
		

This script includes the following steps:

  • Pattern matching: Uses regex /([^@]+)@(.+)/ to split email into username and domain parts
  • Domain categorization: Classifies domains using specific patterns:
    • Personal: gmail.com, yahoo.com, hotmail.com
    • Business: domains ending in .co, .corp, .inc, .org, .edu, .gov
    • Test: domains ending in .zzz
    • Other: Any domain not matching the above patterns
  • TLD extraction: Extracts the top-level domain (com, org, edu, etc.) using /\.([a-zA-Z]{2,})$/ regular expression
  • Test domain detection: Identifies sample data domains ending in .zzz (used in the Kibana ecommerce sample dataset)
  • Fields addition: Adds new fields (email_username, email_domain, email_category, email_tld) to the document for analytics and segmentation

For more details about Painless regular expressions, refer to Painless regex documentation.

Simulate the pipeline with sample e-commerce customer data:

POST _ingest/pipeline/kibana_sample_data_ecommerce-extract_email_domains/_simulate
{
  "docs": [
    {
      "_source": {
        "customer_full_name": "Eddie Underwood",
        "email": "[email protected]"
      }
    },
    {
      "_source": {
        "customer_full_name": "John Smith",
        "email": "[email protected]"
      }
    },
    {
      "_source": {
        "customer_full_name": "Sarah Wilson",
        "email": "[email protected]"
      }
    }
  ]
}
		

The goal is to parse SKU codes using regular expressions to extract manufacturer identifiers, then group products by these patterns to reveal inventory distribution across clothing, accessories, and footwear categories without creating additional index fields.

Aggregations with runtime fields allow you to analyze SKU patterns across thousands of products. The regex patterns validate SKU formats and extract meaningful segments from codes like “ZO0549605496" to categorize products by manufacturer patterns.

Create aggregations that analyze SKU patterns:

GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "runtime_mappings": {
    "sku_category": {
      "type": "keyword",
      "script": {
        "lang": "painless",
        "source": """
          if (doc.containsKey('sku')) {
            String sku = doc['sku'].value;

            // Validate SKU format using regex
            Pattern skuPattern = /^ZO(\d{4})(\d{6})$/;
            Matcher skuMatcher = skuPattern.matcher(sku);

            if (skuMatcher.matches()) {
              String manufacturerCode = skuMatcher.group(1);

              // Determine product category based on manufacturer code patterns
              if (manufacturerCode =~ /^0[1-3].*/) {
                emit("clothing");
              } else if (manufacturerCode =~ /^0[4-6].*/) {
                emit("accessories");
              } else if (manufacturerCode =~ /^0[0-9].*/) {
                emit("footwear");
              } else {
                emit("other");
              }
            } else {
              emit("invalid_or_unknown");
            }
          } else {
            emit("invalid_or_unknown");
          }
        """
      }
    }
  },
  "aggs": {
    "sku_category_breakdown": {
      "terms": {
        "field": "sku_category"
      }
    }
  }
}
		
  1. First 4 digits after ZO

This script includes the following steps:

  • Format validation: Uses regex /^ZO(\d{4})(\d{6})$/ to ensure the SKU formats follow the expected pattern (ZO + 4 digits + 6 digits)
  • Component extraction: Separates the manufacturer code product ID to enable category analysis
  • Category classification: Maps manufacturer patterns to product types for inventory reporting
  • Aggregated analysis: Counts products by category to show inventory distribution without storing new fields

For more details about Painless regular expressions, refer to Painless regex documentation.

The results provide comprehensive SKU format analysis across your product catalog:

The goal is to convert custom timestamp formats like “25-AUG-2025@14h30m” into ISO 8601 standard “2025-08-25T14:30:00Z” during data ingestion. This ensures consistent date fields across datasets from different source systems.

Legacy manufacturing and ERP systems often use non-standard date formats that Elasticsearch cannot directly parse. Converting these during ingestion eliminates the need for repeated parsing at query time and ensures proper date range filtering and sorting.

Create an ingest pipeline that converts custom date formats:

PUT _ingest/pipeline/kibana_sample_data_ecommerce-convert_custom_dates
{
  "description": "Convert custom date formats to ISO 8601 standard",
  "processors": [
    {
      "script": {
        "lang": "painless",
        "source": """
          // Format: "25-AUG-2025@14h30m" (DD-MMM-YYYY@HHhMMm)
          if (ctx.containsKey('manufacturing_date_custom')) {
            String customDate = ctx.manufacturing_date_custom;

            // Extract date components using regex
            Pattern customDatePattern = /(\d{1,2})-([A-Z]{3})-(\d{4})@(\d{2})h(\d{2})m/;
            Matcher customDateMatcher = customDatePattern.matcher(customDate);

            if (customDateMatcher.matches()) {
              String day = customDateMatcher.group(1);
              String monthAbbr = customDateMatcher.group(2);
              String year = customDateMatcher.group(3);
              String hour = customDateMatcher.group(4);
              String minute = customDateMatcher.group(5);

              Map monthMap = [
                'JAN': '01', 'FEB': '02', 'MAR': '03', 'APR': '04',
                'MAY': '05', 'JUN': '06', 'JUL': '07', 'AUG': '08',
                'SEP': '09', 'OCT': '10', 'NOV': '11', 'DEC': '12'
              ];

              // Getting month based on month abbreviation
              String monthNum = monthMap.getOrDefault(monthAbbr, '01');

              // Format day with leading zero using regex pattern validation
              Pattern singleDigitPattern = /^\d$/;
              String dayFormatted = singleDigitPattern.matcher(day).matches() ? "0" + day : day;

              // Create ISO 8601 formatted date
              ctx.manufacturing_date = year + "-" + monthNum + "-" + dayFormatted + "T" + hour + ":" + minute + ":00Z";
              ctx.manufacturing_date_parsed = true;

              // Validate final ISO format using regex
              Pattern validDatePattern = /^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z$/;
              ctx.iso_format_valid = validDatePattern.matcher(ctx.manufacturing_date).matches();

            } else {
              ctx.manufacturing_date_parsed = false;
              ctx.parse_error = "Invalid custom date format";
            }
          }
        """
      }
    }
  ]
}
		

This script includes the following steps:

  • Pattern extraction: Uses regex /(\d{1,2})-([A-Z]{3})-(\d{4})@(\d{2})h(\d{2})m/ to parse the custom format "25-AUG-2025@14h30m"
  • Month conversion: Uses a Map lookup instead of multiple regex conditions for cleaner code
  • Regex validation: Uses the /^\d$/ pattern to detect single-digit days that need zero-padding
  • Format standardization: Reconstructs date components into ISO 8601 format (YYYY-MM-DDTHH:mm:ssZ)
  • ISO validation: Additional regex /^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z$/ confirms the final format is valid

For more details about Painless scripting in the ingest context, refer to Ingest processor context.

Simulate the pipeline with sample custom date formats:

POST _ingest/pipeline/kibana_sample_data_ecommerce-convert_custom_dates/_simulate
{
  "docs": [
    {
      "_source": {
        "product_name": "Winter Jacket",
        "manufacturing_date_custom": "25-AUG-2025@14h30m"
      }
    },
    {
      "_source": {
        "product_name": "Summer Shoes",
        "manufacturing_date_custom": "03-JUN-2025@09h15m"
      }
    }
  ]
}
		

The simulation shows successful conversion to ISO 8601 format:

This tutorial demonstrates practical regex applications across ingest, aggregation, and runtime field contexts. For comprehensive regex capabilities and advanced pattern techniques, explore these resources: