Regex performance regression induced by match-all code
I found a nasty performance problem in commit 824bf7190: given the
right sort of regex, checkmatchall() takes an unreasonable amount
of time. For example,
regression=# SELECT regexp_matches('', '(.|){20}','');
regexp_matches
----------------
{""}
(1 row)
Time: 129.213 ms
regression=# SELECT regexp_matches('', '(.|){25}','');
regexp_matches
----------------
{""}
(1 row)
Time: 4101.416 ms (00:04.101)
regression=# SELECT regexp_matches('', '(.|){30}','');
regexp_matches
----------------
{""}
(1 row)
Time: 130803.927 ms (02:10.804)
That's quite awful compared to v13, where these cases take
only a couple ms.
Worse still, you can't get out of it with control-C, because
checkmatchall_recurse lacks any check-for-interrupt.
The problem here is basically that we're willing to recursively
examine all paths out of the same NFA state over and over, once for
each possible way of arriving at that state. That's dumb and we can
do better, though it takes a little more code and some more memory.
The attached patch applies a few different methods to make this
better:
* Before starting the recursive search, do a linear-time pass
through all the states to check whether there are any non-RAINBOW
arcs. This allows failing fast for most non-matchall NFAs.
* Memo-ize the results of the recursive search, by storing an
array of possible path lengths for each state after we've examined
it once.
* Rewrite the checks for pseudocolor arcs to make them linear
time rather than O(N^2) --- I decided I'd better do that too,
after noting that the problem cases had fairly large numbers
of pre-state outarcs. This makes them cheap enough to do
before the recursive search not after.
* Put a heuristic upper bound on the number of NFA states for
which we'll attempt this optimization at all. The main reason
for this is to bound the amount of memory we can expend on
memoization results. I think that it will result in little
if any degradation in our ability to recognize matchall NFAs,
because of the existing restriction that we can't represent
cases involving path lengths that are finite but more than DUPINF.
If there are a lot more than DUPINF states then (I think) it becomes
pretty likely that we'd fail due to that restriction anyhow.
However, I've not made much attempt to quantify that argument;
I just chose DUPINF * 4 out of the air.
* Just in case that's not enough to fix things, add a cancel check
within checkmatchall_recurse.
The main thing I find a bit ugly about this solution is that
I'm abusing the state->tmp fields (which are declared struct state *)
to hold the memoization results (which are "bool *"). It'd be
possible to avoid that by allocating an additional "bool **" array
indexed by state number, but whether it's worth that depends on how
allergic you are to weird pointer casts.
Comments?
regards, tom lane
Attachments:
fix-exponential-cost-of-checkmatchall-1.patchtext/x-diff; charset=us-ascii; name=fix-exponential-cost-of-checkmatchall-1.patchDownload+297-143
On Sat, May 1, 2021, at 21:46, Tom Lane wrote:
I found a nasty performance problem in commit 824bf7190: given the
right sort of regex, checkmatchall() takes an unreasonable amount
of time.
Nice catch.
fix-exponential-cost-of-checkmatchall-1.patch
I've successfully tested the patch on the regex corpus:
SELECT
is_match <> (subject ~ pattern),
captured IS DISTINCT FROM regexp_match(subject, pattern, flags),
COUNT(*)
FROM performance_test
GROUP BY 1,2
ORDER BY 1,2;
?column? | ?column? | count
----------+----------+---------
f | f | 3253889
(1 row)
HEAD (651d005e76bc0b9542615f609b4d0d946035dc58)
Time: 94096.020 ms (01:34.096)
Time: 93102.287 ms (01:33.102)
Time: 93333.746 ms (01:33.334)
fix-exponential-cost-of-checkmatchall-1.patch
Time: 95247.529 ms (01:35.248)
Time: 92617.502 ms (01:32.618)
Time: 93259.700 ms (01:33.260)
I've also tested the problematic type of regexes:
HEAD (651d005e76bc0b9542615f609b4d0d946035dc58)
SELECT regexp_matches('', '(.|){20}','');
Time: 61.613 ms
SELECT regexp_matches('', '(.|){25}','');
Time: 1928.674 ms (00:01.929)
SELECT regexp_matches('', '(.|){27}','');
Time: 7789.601 ms (00:07.790)
fix-exponential-cost-of-checkmatchall-1.patch
SELECT regexp_matches('', '(.|){20}','');
Time: 0.965 ms
SELECT regexp_matches('', '(.|){25}','');
Time: 0.586 ms
SELECT regexp_matches('', '(.|){27}','');
Time: 0.788 ms
Nice improvement, thanks.
/Joel
"Joel Jacobson" <joel@compiler.org> writes:
On Sat, May 1, 2021, at 21:46, Tom Lane wrote:
I found a nasty performance problem in commit 824bf7190: given the
right sort of regex, checkmatchall() takes an unreasonable amount
of time.
Nice catch.
I've successfully tested the patch on the regex corpus:
Thanks for testing!
The main thing I find a bit ugly about this solution is that
I'm abusing the state->tmp fields (which are declared struct state *)
to hold the memoization results (which are "bool *"). It'd be
possible to avoid that by allocating an additional "bool **" array
indexed by state number, but whether it's worth that depends on how
allergic you are to weird pointer casts.
I tried rewriting it like that, and I have to say I do like it better
that way. I think it's clearer, which seems well worth one extra
malloc.
Also, I poked a little more at the question of the heuristic limit
on number of states, by checking the actual numbers of states in
various ways of writing matchall regexes. It looks to me like
we can cut the limit to DUPINF*2 and still have lots of daylight,
because reasonable (and even not so reasonable) ways to write a
pattern that can match up to K characters seem to come out with
not much more than K states.
Hence, PFA v2. I also added a couple of test cases based on
looking at code coverage in this area, as well as a case that
takes an unreasonably long time without this fix.
regards, tom lane
Attachments:
fix-exponential-cost-of-checkmatchall-2.patchtext/x-diff; charset=us-ascii; name=fix-exponential-cost-of-checkmatchall-2.patchDownload+366-163
On Sun, May 2, 2021, at 18:53, Tom Lane wrote:
fix-exponential-cost-of-checkmatchall-2.patch
Successfully tested.
SELECT
is_match <> (subject ~ pattern),
captured IS DISTINCT FROM regexp_match(subject, pattern, flags),
COUNT(*)
FROM performance_test
GROUP BY 1,2
ORDER BY 1,2;
?column? | ?column? | count
----------+----------+---------
f | f | 3253889
(1 row)
Time: 94149.542 ms (01:34.150)
Time: 91565.305 ms (01:31.565)
Time: 91565.305 ms (01:31.565)
SELECT regexp_matches('', '(.|){20}','');
regexp_matches
----------------
{""}
(1 row)
Time: 0.541 ms
SELECT regexp_matches('', '(.|){25}','');
regexp_matches
----------------
{""}
(1 row)
Time: 0.724 ms
SELECT regexp_matches('', '(.|){27}','');
regexp_matches
----------------
{""}
(1 row)
Time: 0.782 ms
/Joel
"Joel Jacobson" <joel@compiler.org> writes:
On Sun, May 2, 2021, at 18:53, Tom Lane wrote:
fix-exponential-cost-of-checkmatchall-2.patch
Successfully tested.
Again, thanks for checking!
regards, tom lane
On Mon, May 3, 2021, at 21:38, Tom Lane wrote:
"Joel Jacobson" <joel@compiler.org <mailto:joel%40compiler.org>> writes:
On Sun, May 2, 2021, at 18:53, Tom Lane wrote:
fix-exponential-cost-of-checkmatchall-2.patch
Successfully tested.
Again, thanks for checking!
You're welcome, thanks for coding!
/Joel