Skip to content

regex::bytes::Regex::is_match with a simple pattern with long sequences of wildcards is significantly slower than a naïve alternative #1141

Open
@expenses

Description

@expenses

What version of regex are you using?

v1.10.2

Describe the bug at a high level.

I'm trying to match 2D patterns inside 2D flattened array of u8s. For example, to match

0
1

on a 1024x1024 array I'm this pattern: "\\u{0}.{1023}\\u{1}". I'm compiling with

regex::bytes::RegexBuilder::new(&string)
    .unicode(false)
    .dot_matches_new_line(true)
    .build()

My expectation is that this kind of pattern would be at-worst slightly slower than the naïve alternative that follows:

enum RegexOp {
    Match(Vec<u8>),
    Skip(usize)
}

fn regex_match(ops: &[RegexOp], mut slice: &[u8]) -> bool {
    for op in ops {
        match op {
            RegexOp::Skip(skip) => slice = &slice[*skip..],
            RegexOp::Match(m) => {
                if !slice.starts_with(m) {
                    return false;
                }
                slice = &slice[m.len()..];
            }
        }
    }

    true
}

In this case ops looks like [Match([1]), Skip(1023), Match([0])].

Let me know if this is too much to ask.

Currently the regex solution takes 27.70 seconds for a 256x256 workload according to time:

________________________________________________________
Executed in   27.70 secs    fish           external
   usr time   27.65 secs  827.00 micros   27.65 secs
   sys time    0.60 secs   60.00 micros    0.60 secs

Whereas a 2-line change from is_match to my regex_match function drops this down to 147.17 ms (or 517ms if you're taking the longer value, doesn't really matter).

________________________________________________________
Executed in  147.17 millis    fish           external
   usr time  154.46 millis  663.00 micros  153.80 millis
   sys time  517.86 millis  183.00 micros  517.68 millis

I can provide proper benchmarks if you want. I am of-course using release mode and the standard set of regex crate features.

Please let me know if I've made some mistake in RegexBuilder or if a seperate set of feature flags might improve things.

What are the steps to reproduce the behavior?

I'm happy to write a test program if that's required for this issue to get attention, but I believe it should be trivial to write one given what I've written above.

What is the actual behavior?

N/A

What is the expected behavior?

N/A

What do you expect the output to be?

N/A

Related: #590 (comment).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions