Skip to content

Conversation

muellerj2
Copy link
Contributor

Fixes #5490.

Standards

ECMA-262 15.10.2.5 "Term" specifies in the internal continuation of the internal helper function "RepeatMatcher" for production Term :: Atom Quantifier:

If min is zero and y's endIndex is equal to x's endIndex, then return failure.

Here, min is decremented for each successful repeat match until it reaches zero, so the match is supposed to fail if an additional repetition yields an empty match and the specified minimum number of repetititions have been reached.

Similary, POSIX says (here for basic regexes, but the standard contains similar language for extended regexes):

A subexpression repeated by an ('*') or an interval expression shall not match a null expression unless this is the only match for the repetition or it is necessary to satisfy the exact or minimum number of occurrences for the interval expression.

Note the subtle difference that POSIX allows an empty match if there is only one repetition.

Changes

_Matcher::_Do_rep()

The logic has been restructured:

  • First, we check if the minimum number of repetitions have been reached. If not and there is no prior match or the last match wasn't empty (_Progress is true), we do another repetition. If the last match was empty, then we can speed this up by doing only one more match, because the repetitions are internally independent, so the difference can't be observed if we reorder when empty matches appear when repeating the loop. (This is the same change I already mentioned in <regex>: Fix depth-first and leftmost-longest matching rules #5218 (comment), but I wasn't aware at that time that ECMAScript requires this change as well.)
  • If the minimum has just been reached, we have yet to perform a single repetition, or the last match made progress, we try to do another repetition according to maximum repetition limits, the matching mode (longest, depth-first) and the greediness of the quantifier/loop.
  • If the last match is empty, we accept in POSIX mode if this is the only repetition.
  • Otherwise, we reject (_Matched0 stays false).

_Matcher::_Do_rep0()

The logic changes are similar, but one difference is exploited: Since the repeated subexpression must be branchless, all repetitions must either be non-empty or empty. Therefore, we only have to check at the first repetition if the match is non-empty, and we also don't have to try one last match to achieve minimum requirements, because the match will be empty again anyway.

However, the logic changes can't be observed with current NFAs, because capture groups currently can't appear in loops that the simple loop optimization is enabled for. This is because the parser generates unnecessary _N_if nodes.

By applying the following diff, these unnecessary nodes are no longer generated and we can check that the tests still pass when the updated logic for _Do_rep0() is actually exerted:

diff --git a/stl/inc/regex b/stl/inc/regex
index 89e8a509..85adbb41 100644
--- a/stl/inc/regex
+++ b/stl/inc/regex
@@ -4913,15 +4913,17 @@ void _Parser<_FwdIt, _Elem, _RxTraits>::_Disjunction() { // check for valid disj
         _Nfa._End_group(_Pos3);
     }
 
-    _Node_base* _Pos2 = _Nfa._Begin_if(_Pos1);
-    while (_Mchar == _Meta_bar) { // append terms as long as we keep finding | characters
-        _Next();
-        if (!_Alternative()) { // zero-length trailing alternative
-            _Node_base* _Pos3 = _Nfa._Begin_group();
-            _Nfa._End_group(_Pos3);
-        }
+    if (_Mchar == _Meta_bar) { // append terms as long as we keep finding | characters
+        _Node_base* _Pos2 = _Nfa._Begin_if(_Pos1);
+        do {
+            _Next();
+            if (!_Alternative()) { // zero-length trailing alternative
+                _Node_base* _Pos3 = _Nfa._Begin_group();
+                _Nfa._End_group(_Pos3);
+            }
 
-        _Nfa._Else_if(_Pos1, _Pos2);
+            _Nfa._Else_if(_Pos1, _Pos2);
+        } while (_Mchar == _Meta_bar);
     }
 }
 

(I haven't proposed this change yet because I'm not sure yet what to do about its side effects and what is most appropriate for a non-recursive matcher, but I do want to make sure that this change doesn't become impossible due to some erroneous change in _Do_rep0().)

_Builder::_Add_rep2()

In this function, quantifiers ?, ??, {0,1} and {0,1}? were not represented using _N_rep nodes, but instead emitted as two branches of an _N_if node. But this will accept an empty match, so if (a?)? is matched to the empty string, the capture group will be matched to the empty string as well. However, it should actually be unmatched according to the ECMAScript rule.

For this reason, this PR disables this alternative representation when the subexpression contains a capture group or a non-capturing group (for simplicity, because otherwise we would have to keep searching in the non-capturing group for capture groups).

_Parser::_Calculate_loop_simplicity()

Disabling the alternative representation would have the consequence that some loops would no longer be marked as simple. This is undesirable because it can make regexes containing them newly susceptible to stack overflow.

To avoid this, the loop simplicity calculation now also marks loops as simple if they are branchless and all outer loops can be repeated at most once (so in particular, if they were all generated by ?quantifiers).

Note that we must mark a loop is non-simple if some outer loop can be repeated at least twice (and contains a capture group somewhere) because _Do_rep0() does not reset the positions of capture groups when recursive _Match_pat() calls return. If an outer loop were repeated twice, i.e., another repetition of the outer loop could happen in one of _Do_rep0()'s _Match_pat(), the positions of capture groups in the outer loop would change. Since _Do_rep0() does not reset the capture groups, the starting positions of any of these capture groups would be corrupted during the remaining processing in _Do_rep0().

Miscellaneous

@muellerj2 muellerj2 requested a review from a team as a code owner May 11, 2025 18:19
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews May 11, 2025
@muellerj2 muellerj2 changed the title <regex>: Reject to match empty repetitions when required by regex grammars <regex>: Reject empty repetitions when required by regex grammars May 11, 2025
@StephanTLavavej StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels May 11, 2025
@StephanTLavavej StephanTLavavej self-assigned this May 11, 2025
@StephanTLavavej
Copy link
Member

Thanks! 😻 I expanded the test coverage slightly and fixed a couple of test bugs.

@StephanTLavavej StephanTLavavej removed their assignment May 15, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews May 15, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 16, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 0992807 into microsoft:main May 17, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews May 17, 2025
@StephanTLavavej
Copy link
Member

🪹 🪫 🔅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working regex meow is a substring of homeowner

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

<regex>: Optional empty repetitions are illegal

2 participants