`<regex>`: Properly parse backslashes in character classes of basic regexes #5523

muellerj2 · 2025-05-18T12:52:30Z

This renames the parser to add a new member variable describing the lexer mode: Default or inside character class. This allows the lexer to correctly process a backslash when parsing a character class/bracket expression.

I also tried to do this without renaming the parser, but this would mean we would have to pass the lexer mode (in or outside a character class) as an argument to all the functions processing escapes in any way, which is a bit of a pain. By renaming the parser, we need the least changes to the logic itself.

Since the parser is renamed, this PR is also doing a number of minor cleanups to the parser and builder (which is also renamed to do these cleanups).

The PR is split into several commits to simplify reviewing:

Rename _Parser to _Parser2.
Since we have renamed the parser, we can strip any version numbers from member functions.
Clean up the parse flags, which we can do now because there is no longer any chance of mix-and-matching the parser constructor and the parser member function _Compile. Specifically:
- _L_brk_bal is assigned its own bit; previously it was
  
  STL/stl/inc/regex
  
  Line 1879 in cbd091e
  
  _L_brk_bal = 0x20000000, // ']' special only after '[' (ERE, BRE); TRANSITION, ABI: same value as _L_brk_rstr
- The _L_grp_esc flag is added to the awk flags so that the workaround in _ClassAtom can be removed.
- I also extended the _Lang_flags enum and the _L_flags member variable to unsigned long long so that we can add more flags more easily in the future. (This already adds _L_dsh_rstr to signify that the dash - cannot appear as the starting point of a character range in BREs and EREs, but doesn't perform the parser changes to support it yet.)
Remove the unused member _Begin from the parser.
Slightly reorder the parser member variables to reduce padding a bit. (_Char is usually a char or wchar_t, so it [plus the single-byte _Mode member variable added in the last commit] can usually fit into the four bytes the compiler must add after _Mchar.
Rename _Builder to _Builder2.
Strip version numbers from member functions of the builder.
Remove obsolete members _Bmax and _Tmax from the builder.
Actually fix <regex>: Backslashes in character classes are sometimes not matched in basic regular expressions #5379 essentially by making _Is_esc() always return false when not in default (read: outside-bracketed-character-class) mode. Note that it matters how we change the lexer mode in _Parser2::_Alternative(): _Next() and _Expect() process the first token inside or outside the square brackets, so we must change the mode before calling these functions. The tests check that we didn't get this wrong.

…arse flag for forbidden dash at range start

…or grep mode

stl/inc/regex

StephanTLavavej · 2025-05-22T11:52:17Z

Reviewing now, I'll push changes soon. I updated the PR description from saying "_L_paren_bal is assigned its own bit." to say _L_brk_bal instead; please meow if I was somehow confused.

Use muellerj2's superior descriptions of `_L_alt_nl` and `_L_no_nl`. Note that `_L_no_nl` is (grep, egrep). Note that `_L_esc_oct` and `_L_esc_ffnx` are (awk). `_L_esc_ffn` confusingly said "(\[fnrtv])" when other comments like `_L_ident_ERE` mean square brackets literally. Spell out "(\f \n \r \t \v)" for clarity and improved searchability. Rephrase `_L_ident_awk`'s comment for clarity. Note that `_L_anch_rstr` is (BRE) only, `_L_paren_bal` is (ERE) only, and `_L_brk_rstr` is (ERE, BRE).

stl/inc/regex

StephanTLavavej · 2025-05-22T13:56:50Z

Thanks!! 😻 I pushed some follow-up commits for additional cleanups, please double-check.

I really appreciate the well-structured commit history here; ordinarily I would be nervous about mixing a refactoring and a bugfix but this was entirely reasonable.

StephanTLavavej · 2025-05-22T14:23:22Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-05-22T20:03:49Z

Thanks for the infinite bugfixes in infinite combinations, as the Vulcans say! 🖖 🐞 🛠️

muellerj2 added 9 commits May 18, 2025 13:17

rename _Parser to _Parser2

6130bce

remove unnecessary version numbers from _Parser2's member functions

bcdf787

clean up parse flags, extend range of possible parse flags, add new p…

d93567b

…arse flag for forbidden dash at range start

remove unused member _Begin from _Parser2

a687fb0

reorder _Parser2 members to reduce padding

8adea0f

rename _Builder to _Builder2

3cefca7

remove version numbers from member functions of _Builder2

1c0c01d

remove obsolete member variables from _Builder2

7d9748a

do not escape parentheses and braces in bracket expressions in basic …

5fbd5b1

…or grep mode

muellerj2 requested a review from a team as a code owner May 18, 2025 12:52

github-project-automation bot added this to STL Code Reviews May 18, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews May 18, 2025

StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels May 18, 2025

StephanTLavavej reviewed May 18, 2025

View reviewed changes

stl/inc/regex Outdated Show resolved Hide resolved

StephanTLavavej self-assigned this May 18, 2025

StephanTLavavej added 7 commits May 22, 2025 03:14

Rename _Builder2::_Add_char2 back to _Add_char.

ccc071b

Drop "ABI zombie name" comment.

8a92b64

Delete copy ctors for _Builder2 and _Matcher2.

b6651d8

_Builder2::_Add_nop was unused.

e32b28b

Avoid shadowing in _Builder2: _Negate => _Negative

260cbce

_Parser2::char_class_type was unused.

43c3235

_Parser2::_Mark_count can be private.

50204cb

StephanTLavavej added 5 commits May 22, 2025 05:06

Use a data member initializer for _Lex_mode _Mode.

4fba27c

Meld _L_nc_grp and _L_asrt_gen into _L_nc_asrt, improve comments.

5e0151d

Renumber _Lang_flags2.

cfaf2df

Rename to _L_non_greedy.

e9eb1c6

StephanTLavavej reviewed May 22, 2025

View reviewed changes

StephanTLavavej approved these changes May 22, 2025

View reviewed changes

StephanTLavavej removed their assignment May 22, 2025

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews May 22, 2025

StephanTLavavej mentioned this pull request May 22, 2025

Maintainer priorities #4700

Open

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 22, 2025

StephanTLavavej merged commit 6e8a91f into microsoft:main May 22, 2025
40 checks passed

github-project-automation bot moved this from Merging to Done in STL Code Reviews May 22, 2025

muellerj2 deleted the regex-bre-fix-backslashes-in-char-classes branch May 31, 2025 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`<regex>`: Properly parse backslashes in character classes of basic regexes #5523

`<regex>`: Properly parse backslashes in character classes of basic regexes #5523

Uh oh!

muellerj2 commented May 18, 2025 •

edited by StephanTLavavej

Loading

Uh oh!

Uh oh!

StephanTLavavej commented May 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented May 22, 2025

Uh oh!

StephanTLavavej commented May 22, 2025

Uh oh!

Uh oh!

StephanTLavavej commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

<regex>: Properly parse backslashes in character classes of basic regexes #5523

<regex>: Properly parse backslashes in character classes of basic regexes #5523

Uh oh!

Conversation

muellerj2 commented May 18, 2025 • edited by StephanTLavavej Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented May 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented May 22, 2025

Uh oh!

StephanTLavavej commented May 22, 2025

Uh oh!

Uh oh!

StephanTLavavej commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`<regex>`: Properly parse backslashes in character classes of basic regexes #5523

`<regex>`: Properly parse backslashes in character classes of basic regexes #5523

muellerj2 commented May 18, 2025 •

edited by StephanTLavavej

Loading