Skip to content

SRE ignores the ASCII flag on character ranges with non-BMP upper bound #126505

Closed
@djoooooe

Description

@djoooooe

Bug report

Bug description:

It seems like SRE ignores the ASCII flag when parsing a character range whose upper bound is beyond the BMP region:

>>> import re

# should match
>>> regex = re.compile("[\ua7aa-\uffff]", re.IGNORECASE)
>>> print(regex.match("\u0266"))
<re.Match object; span=(0, 1), match='ɦ'> 

# should not match
>>> regex = re.compile("[\ua7aa-\U00010000]", re.ASCII | re.IGNORECASE)
>>> print(regex.match("\u0266"))
<re.Match object; span=(0, 1), match='ɦ'>

# must be related to case folding, since \ua7aa folds to \u0266
>>> regex = re.compile("[\ua7ab-\U00010000]", re.ASCII | re.IGNORECASE)
>>> print(regex.match("\u0266"))
None

# correct behavior when upper bound is in BMP
>>> regex = re.compile("[\ua7aa-\uffff]", re.ASCII | re.IGNORECASE)
>>> print(regex.match("\u0266"))
None

CPython versions tested on:

3.12

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Labels

3.12only security fixes3.13bugs and security fixes3.14bugs and security fixesextension-modulesC modules in the Modules dirtopic-regextype-bugAn unexpected behavior, bug, or error

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions