Skip to content

decodeHtmlEntities must decode invisible-char named entities (­, ‍, ‎, etc.) before hardenUnicodeText Step 3 strippi [Content truncated due to length] #31702

@szabta89

Description

@szabta89

Summary

decodeHtmlEntities in actions/setup/js/sanitize_content_core.cjs (v0.68.3) handles decimal () and hex () numeric entities for invisible/formatting characters, but does not handle their named entity forms (­, ‍, ‌, ‎, ‏). Because hardenUnicodeText Step 3 operates on actual Unicode code points — not on &name; string literals — the named entity forms survive intact. neutralizeAllMentions then fails to match @­victim because the character after @ is & (not [A-Za-z0-9]), so the mention passes through unsanitized. When GitHub renders the output, the entity decodes to an invisible character and the result appears as @victim to readers. This is a partial bypass of the fix applied in gh-aw#24154 (originating from the #1611 finding).

Affected Area

Safe-outputs output sanitization boundary — sanitizeContent / sanitizeContentCoredecodeHtmlEntities (line ~975) → hardenUnicodeText Step 3 stripping regex. Affects all safe-output types where body/text fields have "sanitize": true in validation.json.

Reproduction Outline

  1. Obtain sanitize_content_core.cjs at v0.68.3 (SHA 159c2fed045bdd850374b084fe92182c9e31b147237944f41aecd765d068e685).
  2. Run: node -e "const {sanitizeContentCore} = require('./sanitize_content_core.cjs'); console.log(sanitizeContentCore('@\u00ADvictim say hi'));"
    → Output includes `@victim`neutralized (numeric/direct-char form is fixed).
  3. Run: node -e "const {sanitizeContentCore} = require('./sanitize_content_core.cjs'); console.log(sanitizeContentCore('@­victim say hi'));"
    → Output is @­victim say hi unchanged — bypassed.
  4. Repeat step 3 substituting ‍, ‌, ‎, ‏ — all bypass.
  5. Confirm root cause: decodeHtmlEntities source lists no entries for ­, ‍, ‌, ‎, ‏, &wj;, or ​.

Observed Behavior

sanitizeContentCore('@­victim say hi') returns the input unchanged. The named entity ­ is not decoded by decodeHtmlEntities, survives Step 3 stripping, and defeats neutralizeAllMentions because & is not in [A-Za-z0-9].

Expected Behavior

sanitizeContentCore('@­victim') should return `@victim` — the named entity should be decoded to its Unicode code point (U+00AD) before Step 3 strips it, and the resulting bare @victim should be neutralized like any other mention.

Security Relevance

The @mention neutralization guarantee documented at the safe-outputs reference page is violated for named HTML entity forms of invisible characters. An adversarial issue or PR body that causes the AI to emit @­maintainer (achievable via prompt injection) will pass through the sanitizer and, after GitHub renders it, may trigger a real notification to @maintainer. The bypass is achievable with any safe-output type that routes content through sanitizeContent.

Suggested Fix

Extend decodeHtmlEntities (after the & block) to map named invisible-char entities to their Unicode code points before Step 3 runs:

result = result.replace(/­/gi,            "\u00AD");  // soft hyphen
result = result.replace(/‍/gi,            "\u200D");  // zero-width joiner
result = result.replace(/‌/gi,           "\u200C");  // zero-width non-joiner
result = result.replace(/‎/gi,            "\u200E");  // left-to-right mark
result = result.replace(/‏/gi,            "\u200F");  // right-to-left mark
result = result.replace(/&wj;/gi,             "\u2060");  // word joiner
result = result.replace(/​/gi, "\u200B");  // zero-width space

Also add regression tests asserting that sanitizeContentCore('@­victim') and sanitizeContentCore('@‎victim') produce neutralized output. A broader audit of HTML5 named character references for additional invisible/confusable characters is also warranted.

Additional Context

If the current named entity behavior is intentional (e.g., the sanitizer is not expected to handle HTML entity-encoded content), that assumption should be explicitly documented alongside the @mention neutralization guarantee in the safe-outputs reference, along with any upstream requirements that guarantee content arrives pre-decoded.

Original finding: https://github.com/githubnext/gh-aw-security/issues/2086


gh-aw version: v0.68.3

Generated by File Issue · ● 368.8K ·

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions