`<regex>`: Reject empty repetitions when required by regex grammars #5494

muellerj2 · 2025-05-11T18:19:28Z

Standards

ECMA-262 15.10.2.5 "Term" specifies in the internal continuation of the internal helper function "RepeatMatcher" for production Term :: Atom Quantifier:

If min is zero and y's endIndex is equal to x's endIndex, then return failure.

Here, min is decremented for each successful repeat match until it reaches zero, so the match is supposed to fail if an additional repetition yields an empty match and the specified minimum number of repetititions have been reached.

Similary, POSIX says (here for basic regexes, but the standard contains similar language for extended regexes):

A subexpression repeated by an ('*') or an interval expression shall not match a null expression unless this is the only match for the repetition or it is necessary to satisfy the exact or minimum number of occurrences for the interval expression.

Note the subtle difference that POSIX allows an empty match if there is only one repetition.

Changes

`_Matcher::_Do_rep()`

The logic has been restructured:

First, we check if the minimum number of repetitions have been reached. If not and there is no prior match or the last match wasn't empty (_Progress is true), we do another repetition. If the last match was empty, then we can speed this up by doing only one more match, because the repetitions are internally independent, so the difference can't be observed if we reorder when empty matches appear when repeating the loop. (This is the same change I already mentioned in <regex>: Fix depth-first and leftmost-longest matching rules #5218 (comment), but I wasn't aware at that time that ECMAScript requires this change as well.)
If the minimum has just been reached, we have yet to perform a single repetition, or the last match made progress, we try to do another repetition according to maximum repetition limits, the matching mode (longest, depth-first) and the greediness of the quantifier/loop.
If the last match is empty, we accept in POSIX mode if this is the only repetition.
Otherwise, we reject (_Matched0 stays false).

`_Matcher::_Do_rep0()`

The logic changes are similar, but one difference is exploited: Since the repeated subexpression must be branchless, all repetitions must either be non-empty or empty. Therefore, we only have to check at the first repetition if the match is non-empty, and we also don't have to try one last match to achieve minimum requirements, because the match will be empty again anyway.

However, the logic changes can't be observed with current NFAs, because capture groups currently can't appear in loops that the simple loop optimization is enabled for. This is because the parser generates unnecessary _N_if nodes.

By applying the following diff, these unnecessary nodes are no longer generated and we can check that the tests still pass when the updated logic for _Do_rep0() is actually exerted:

diff --git a/stl/inc/regex b/stl/inc/regex
index 89e8a509..85adbb41 100644
--- a/stl/inc/regex
+++ b/stl/inc/regex
@@ -4913,15 +4913,17 @@ void _Parser<_FwdIt, _Elem, _RxTraits>::_Disjunction() { // check for valid disj
         _Nfa._End_group(_Pos3);
     }
 
-    _Node_base* _Pos2 = _Nfa._Begin_if(_Pos1);
-    while (_Mchar == _Meta_bar) { // append terms as long as we keep finding | characters
-        _Next();
-        if (!_Alternative()) { // zero-length trailing alternative
-            _Node_base* _Pos3 = _Nfa._Begin_group();
-            _Nfa._End_group(_Pos3);
-        }
+    if (_Mchar == _Meta_bar) { // append terms as long as we keep finding | characters
+        _Node_base* _Pos2 = _Nfa._Begin_if(_Pos1);
+        do {
+            _Next();
+            if (!_Alternative()) { // zero-length trailing alternative
+                _Node_base* _Pos3 = _Nfa._Begin_group();
+                _Nfa._End_group(_Pos3);
+            }
 
-        _Nfa._Else_if(_Pos1, _Pos2);
+            _Nfa._Else_if(_Pos1, _Pos2);
+        } while (_Mchar == _Meta_bar);
     }
 }

(I haven't proposed this change yet because I'm not sure yet what to do about its side effects and what is most appropriate for a non-recursive matcher, but I do want to make sure that this change doesn't become impossible due to some erroneous change in _Do_rep0().)

`_Builder::_Add_rep2()`

In this function, quantifiers ?, ??, {0,1} and {0,1}? were not represented using _N_rep nodes, but instead emitted as two branches of an _N_if node. But this will accept an empty match, so if (a?)? is matched to the empty string, the capture group will be matched to the empty string as well. However, it should actually be unmatched according to the ECMAScript rule.

For this reason, this PR disables this alternative representation when the subexpression contains a capture group or a non-capturing group (for simplicity, because otherwise we would have to keep searching in the non-capturing group for capture groups).

`_Parser::_Calculate_loop_simplicity()`

Disabling the alternative representation would have the consequence that some loops would no longer be marked as simple. This is undesirable because it can make regexes containing them newly susceptible to stack overflow.

To avoid this, the loop simplicity calculation now also marks loops as simple if they are branchless and all outer loops can be repeated at most once (so in particular, if they were all generated by ?quantifiers).

Note that we must mark a loop is non-simple if some outer loop can be repeated at least twice (and contains a capture group somewhere) because _Do_rep0() does not reset the positions of capture groups when recursive _Match_pat() calls return. If an outer loop were repeated twice, i.e., another repetition of the outer loop could happen in one of _Do_rep0()'s _Match_pat(), the positions of capture groups in the outer loop would change. Since _Do_rep0() does not reset the capture groups, the starting positions of any of these capture groups would be corrupted during the remaining processing in _Do_rep0().

Miscellaneous

I centralized the definition of _Any_posix because there was a need for it in _Do_rep() and _Do_rep0() now as well.
_Do_rep() and _Do_rep0() no longer reset _Tgt_state when they didn't match. The correctness argument for these changes is the same as for the corresponding change to _Do_if() in <regex>: Simplify matching of _N_if NFA nodes in leftmost-longest mode #5405.

…rammars

stl/inc/regex

tests/std/tests/VSO_0000000_regex_use/test.cpp

StephanTLavavej · 2025-05-15T11:40:28Z

Thanks! 😻 I expanded the test coverage slightly and fixed a couple of test bugs.

StephanTLavavej · 2025-05-16T16:03:26Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-05-17T04:26:43Z

🪹 🪫 🔅

<regex>: Reject to match empty repetitions when required by regex g…

5f4fc39

…rammars

muellerj2 requested a review from a team as a code owner May 11, 2025 18:19

github-project-automation bot added this to STL Code Reviews May 11, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews May 11, 2025

muellerj2 changed the title ~~<regex>: Reject to match empty repetitions when required by regex grammars~~ <regex>: Reject empty repetitions when required by regex grammars May 11, 2025

StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels May 11, 2025

StephanTLavavej self-assigned this May 11, 2025

StephanTLavavej added 5 commits May 15, 2025 03:56

Fix comment typos, drop extra newline.

7476265

I see dead patterns. They don't even know they're dead.

be3c90d

Add test coverage.

a1830f9

Fix ? usage in BREs, add positive test cases.

7266d93

Reorder test cases.

bcf3a79

StephanTLavavej reviewed May 15, 2025

View reviewed changes

StephanTLavavej approved these changes May 15, 2025

View reviewed changes

StephanTLavavej removed their assignment May 15, 2025

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews May 15, 2025

StephanTLavavej mentioned this pull request May 15, 2025

Maintainer priorities #4700

Open

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 16, 2025

StephanTLavavej merged commit 0992807 into microsoft:main May 17, 2025
39 checks passed

github-project-automation bot moved this from Merging to Done in STL Code Reviews May 17, 2025

StephanTLavavej mentioned this pull request May 17, 2025

<regex>: Improve search performance for regexes with initial + quantifiers #5509

Merged

muellerj2 deleted the regex-fix-empty-repetitions branch May 31, 2025 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`<regex>`: Reject empty repetitions when required by regex grammars #5494

`<regex>`: Reject empty repetitions when required by regex grammars #5494

Uh oh!

muellerj2 commented May 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented May 15, 2025

Uh oh!

StephanTLavavej commented May 16, 2025

Uh oh!

Uh oh!

StephanTLavavej commented May 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

<regex>: Reject empty repetitions when required by regex grammars #5494

<regex>: Reject empty repetitions when required by regex grammars #5494

Uh oh!

Conversation

muellerj2 commented May 11, 2025

Standards

Changes

_Matcher::_Do_rep()

_Matcher::_Do_rep0()

_Builder::_Add_rep2()

_Parser::_Calculate_loop_simplicity()

Miscellaneous

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented May 15, 2025

Uh oh!

StephanTLavavej commented May 16, 2025

Uh oh!

Uh oh!

StephanTLavavej commented May 17, 2025

🪹 🪫 🔅

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`<regex>`: Reject empty repetitions when required by regex grammars #5494

`<regex>`: Reject empty repetitions when required by regex grammars #5494

`_Matcher::_Do_rep()`

`_Matcher::_Do_rep0()`

`_Builder::_Add_rep2()`

`_Parser::_Calculate_loop_simplicity()`