Skip to content

Conversation

muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented May 18, 2025

Fixes #5379.

This renames the parser to add a new member variable describing the lexer mode: Default or inside character class. This allows the lexer to correctly process a backslash when parsing a character class/bracket expression.

I also tried to do this without renaming the parser, but this would mean we would have to pass the lexer mode (in or outside a character class) as an argument to all the functions processing escapes in any way, which is a bit of a pain. By renaming the parser, we need the least changes to the logic itself.

Since the parser is renamed, this PR is also doing a number of minor cleanups to the parser and builder (which is also renamed to do these cleanups).

The PR is split into several commits to simplify reviewing:

  • Rename _Parser to _Parser2.
  • Since we have renamed the parser, we can strip any version numbers from member functions.
  • Clean up the parse flags, which we can do now because there is no longer any chance of mix-and-matching the parser constructor and the parser member function _Compile. Specifically:
    • _L_brk_bal is assigned its own bit; previously it was

      STL/stl/inc/regex

      Line 1879 in cbd091e

      _L_brk_bal = 0x20000000, // ']' special only after '[' (ERE, BRE); TRANSITION, ABI: same value as _L_brk_rstr
    • The _L_grp_esc flag is added to the awk flags so that the workaround in _ClassAtom can be removed.
    • I also extended the _Lang_flags enum and the _L_flags member variable to unsigned long long so that we can add more flags more easily in the future. (This already adds _L_dsh_rstr to signify that the dash - cannot appear as the starting point of a character range in BREs and EREs, but doesn't perform the parser changes to support it yet.)
  • Remove the unused member _Begin from the parser.
  • Slightly reorder the parser member variables to reduce padding a bit. (_Char is usually a char or wchar_t, so it [plus the single-byte _Mode member variable added in the last commit] can usually fit into the four bytes the compiler must add after _Mchar.
  • Rename _Builder to _Builder2.
  • Strip version numbers from member functions of the builder.
  • Remove obsolete members _Bmax and _Tmax from the builder.
  • Actually fix <regex>: Backslashes in character classes are sometimes not matched in basic regular expressions #5379 essentially by making _Is_esc() always return false when not in default (read: outside-bracketed-character-class) mode. Note that it matters how we change the lexer mode in _Parser2::_Alternative(): _Next() and _Expect() process the first token inside or outside the square brackets, so we must change the mode before calling these functions. The tests check that we didn't get this wrong.

@muellerj2 muellerj2 requested a review from a team as a code owner May 18, 2025 12:52
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews May 18, 2025
@StephanTLavavej StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels May 18, 2025
@StephanTLavavej StephanTLavavej self-assigned this May 18, 2025
@StephanTLavavej
Copy link
Member

Reviewing now, I'll push changes soon. I updated the PR description from saying "_L_paren_bal is assigned its own bit." to say _L_brk_bal instead; please meow if I was somehow confused.

Use muellerj2's superior descriptions of `_L_alt_nl` and `_L_no_nl`.

Note that `_L_no_nl` is (grep, egrep).

Note that `_L_esc_oct` and `_L_esc_ffnx` are (awk).

`_L_esc_ffn` confusingly said "(\[fnrtv])" when other comments like `_L_ident_ERE` mean square brackets literally. Spell out "(\f \n \r \t \v)" for clarity and improved searchability.

Rephrase `_L_ident_awk`'s comment for clarity.

Note that `_L_anch_rstr` is (BRE) only, `_L_paren_bal` is (ERE) only, and `_L_brk_rstr` is (ERE, BRE).
@StephanTLavavej
Copy link
Member

Thanks!! 😻 I pushed some follow-up commits for additional cleanups, please double-check.

I really appreciate the well-structured commit history here; ordinarily I would be nervous about mixing a refactoring and a bugfix but this was entirely reasonable.

@StephanTLavavej StephanTLavavej removed their assignment May 22, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews May 22, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 22, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 6e8a91f into microsoft:main May 22, 2025
40 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews May 22, 2025
@StephanTLavavej
Copy link
Member

Thanks for the infinite bugfixes in infinite combinations, as the Vulcans say! 🖖 🐞 🛠️

@muellerj2 muellerj2 deleted the regex-bre-fix-backslashes-in-char-classes branch May 31, 2025 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working regex meow is a substring of homeowner

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

<regex>: Backslashes in character classes are sometimes not matched in basic regular expressions

2 participants