decodeHtmlEntities must decode invisible-char named entities (, &zwj;, &lrm;, etc.) before hardenUnicodeText Step 3 strippi [Content truncated due to length]

Summary

decodeHtmlEntities in actions/setup/js/sanitize_content_core.cjs (v0.68.3) handles decimal () and hex () numeric entities for invisible/formatting characters, but does not handle their named entity forms (, &zwj;, &zwnj;, &lrm;, &rlm;). Because hardenUnicodeText Step 3 operates on actual Unicode code points — not on &name; string literals — the named entity forms survive intact. neutralizeAllMentions then fails to match @victim because the character after @ is & (not [A-Za-z0-9]), so the mention passes through unsanitized. When GitHub renders the output, the entity decodes to an invisible character and the result appears as @victim to readers. This is a partial bypass of the fix applied in gh-aw#24154 (originating from the #1611 finding).

Affected Area

Safe-outputs output sanitization boundary — sanitizeContent / sanitizeContentCore → decodeHtmlEntities (line ~975) → hardenUnicodeText Step 3 stripping regex. Affects all safe-output types where body/text fields have "sanitize": true in validation.json.

Reproduction Outline

Obtain sanitize_content_core.cjs at v0.68.3 (SHA 159c2fed045bdd850374b084fe92182c9e31b147237944f41aecd765d068e685).
Run: node -e "const {sanitizeContentCore} = require('./sanitize_content_core.cjs'); console.log(sanitizeContentCore('@\u00ADvictim say hi'));"
→ Output includes `@victim` — neutralized (numeric/direct-char form is fixed).
Run: node -e "const {sanitizeContentCore} = require('./sanitize_content_core.cjs'); console.log(sanitizeContentCore('@victim say hi'));"
→ Output is @victim say hi unchanged — bypassed.
Repeat step 3 substituting &zwj;, &zwnj;, &lrm;, &rlm; — all bypass.
Confirm root cause: decodeHtmlEntities source lists no entries for , &zwj;, &zwnj;, &lrm;, &rlm;, &wj;, or &ZeroWidthSpace;.

Observed Behavior

sanitizeContentCore('@victim say hi') returns the input unchanged. The named entity  is not decoded by decodeHtmlEntities, survives Step 3 stripping, and defeats neutralizeAllMentions because & is not in [A-Za-z0-9].

Expected Behavior

sanitizeContentCore('@victim') should return `@victim` — the named entity should be decoded to its Unicode code point (U+00AD) before Step 3 strips it, and the resulting bare @victim should be neutralized like any other mention.

Security Relevance

The @mention neutralization guarantee documented at the safe-outputs reference page is violated for named HTML entity forms of invisible characters. An adversarial issue or PR body that causes the AI to emit @maintainer (achievable via prompt injection) will pass through the sanitizer and, after GitHub renders it, may trigger a real notification to @maintainer. The bypass is achievable with any safe-output type that routes content through sanitizeContent.

Suggested Fix

Extend decodeHtmlEntities (after the & block) to map named invisible-char entities to their Unicode code points before Step 3 runs:

result = result.replace(/&shy;/gi,            "\u00AD");  // soft hyphen
result = result.replace(/&zwj;/gi,            "\u200D");  // zero-width joiner
result = result.replace(/&zwnj;/gi,           "\u200C");  // zero-width non-joiner
result = result.replace(/&lrm;/gi,            "\u200E");  // left-to-right mark
result = result.replace(/&rlm;/gi,            "\u200F");  // right-to-left mark
result = result.replace(/&wj;/gi,             "\u2060");  // word joiner
result = result.replace(/&ZeroWidthSpace;/gi, "\u200B");  // zero-width space

Also add regression tests asserting that sanitizeContentCore('@victim') and sanitizeContentCore('@&lrm;victim') produce neutralized output. A broader audit of HTML5 named character references for additional invisible/confusable characters is also warranted.

Additional Context

If the current named entity behavior is intentional (e.g., the sanitizer is not expected to handle HTML entity-encoded content), that assumption should be explicitly documented alongside the @mention neutralization guarantee in the safe-outputs reference, along with any upstream requirements that guarantee content arrives pre-decoded.

Original finding: https://github.com/githubnext/gh-aw-security/issues/2086

gh-aw version: v0.68.3

Generated by File Issue · ● 368.8K · ◷

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decodeHtmlEntities must decode invisible-char named entities (, &zwj;, &lrm;, etc.) before hardenUnicodeText Step 3 strippi [Content truncated due to length] #31702

Summary

Affected Area

Reproduction Outline

Observed Behavior

Expected Behavior

Security Relevance

Suggested Fix

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

decodeHtmlEntities must decode invisible-char named entities (&shy;, &zwj;, &lrm;, etc.) before hardenUnicodeText Step 3 strippi [Content truncated due to length] #31702

Description

Summary

Affected Area

Reproduction Outline

Observed Behavior

Expected Behavior

Security Relevance

Suggested Fix

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

decodeHtmlEntities must decode invisible-char named entities (, &zwj;, &lrm;, etc.) before hardenUnicodeText Step 3 strippi [Content truncated due to length] #31702