Some regular-expression performance hacking
As I mentioned in connection with adding the src/test/modules/test_regex
test code, I've been fooling with some performance improvements to our
regular expression engine. Here's the first fruits of that labor.
This is mostly concerned with cutting the overhead for handling trivial
unconstrained patterns like ".*".
0001 creates the concept of a "rainbow" arc within regex NFAs. You can
read background info about this in the "Colors and colormapping" part of
regex/README, but the basic point is that right now, representing a dot
(".", match anything) within an NFA requires a separate arc for each
"color" (character equivalence class) that the regex needs. This uses
up a fair amount of storage and processing effort, especially in larger
regexes which tend to have a lot of colors. We can replace such a
"rainbow" of arcs with a single arc labeled with a special color
RAINBOW. This is worth doing on its own account, just because it saves
space and time. For example, on the reg-33.15.1 test case in
test_regex.sql (a moderately large real-world RE), I find that HEAD
requires 1377614 bytes to represent the compiled RE, and the peak space
usage during pg_regcomp() is 3124376 bytes. With this patch, that drops
to 1077166 bytes for the RE (21% savings) with peak compilation space
2800752 bytes (10% savings). Moreover, the runtime for that test case
drops from ~57ms to ~44ms, a 22% savings. (This is mostly measuring the
RE compilation time. Execution time should drop a bit too since miss()
need consider fewer arcs; but that savings is in a cold code path so it
won't matter much.) These aren't earth-shattering numbers of course,
but for the amount of code needed, it seems well worth while.
A possible point of contention is that I exposed the idea of a rainbow
arc in the regexport.h APIs, which will force consumers of that API
to adapt --- see the changes to contrib/pg_trgm for an example. I'm
not too concerned about this because I kinda suspect that pg_trgm is
the only consumer of that API anywhere. (codesearch.debian.net knows
of no others, anyway.) We could in principle hide the change by
having the regexport functions expand a rainbow arc into one for
each color, but that seems like make-work. pg_trgm would certainly
not see it as an improvement, and in general users of that API should
appreciate recognizing rainbows as such, since they might be able to
apply optimizations that depend on doing so.
Which brings us to 0002, which is exactly such an optimization.
The idea here is to short-circuit character-by-character scanning
when matching a sub-NFA that is like "." or ".*" or variants of
that, ie it will match any sequence of some number of characters.
This requires the ability to recognize that a particular pair of
NFA states are linked by a rainbow, so it's a lot less painful
to do when rainbows are represented explicitly. The example that
got me interested in this is adapted from a Tcl trouble report:
select array_dims(regexp_matches(repeat('x',40) || '=' || repeat('y',50000),
'^(.*)=(.*)$'));
On my machine, this takes about 6 seconds in HEAD, because there's an
O(N^2) effect: we try to match the sub-NFA for the first "(.*)" capture
group to each possible starting string, and only after expensively
verifying that tautological match do we check to see if the next
character is "=". By not having to do any per-character work to decide
that .* matches a substring, the O(N^2) behavior is removed and the time
drops to about 7 msec.
(One could also imagine fixing this by rearranging things to check for
the "=" match before verifying the capture-group matches. That's an
idea I hope to look into in future, because it could help for cases
where the variable parts are not merely ".*". But I don't have clear
ideas about how to do that, and in any case ".*" is common enough that
the present change should still be helpful.)
There are two non-boilerplate parts of the 0002 patch. One is the
checkmatchall() function that determines whether an NFA is match-all,
and if so what the min and max match lengths are. This is actually not
very complicated once you understand what the regex engine does at the
"pre" and "post" states. (See the "Detailed semantics" part of
regex/README for some info about that, which I tried to clarify as part
of the patch.) Other than those endpoint conditions it's just a
recursive graph search. The other hard part is the changes in
rege_dfa.c to provide the actual short-circuit behavior at runtime.
That's ticklish because it's trying to emulate some overly complicated
and underly documented code, particularly in longest() and shortest().
I think that stuff is right; I've studied it and tested it. But it
could use more eyeballs.
Notably, I had to add some more test cases to test_regex.sql to exercise
the short-circuit part of matchuntil() properly. That's only used for
lookbehind constraints, so we won't hit the short-circuit path except
with something like '(?<=..)', which is maybe a tad silly.
I'll add this to the upcoming commitfest.
regards, tom lane
Hi Tom,
On Thu, Feb 11, 2021, at 05:39, Tom Lane wrote:
0001-invent-rainbow-arcs.patch
0002-recognize-matchall-NFAs.patch
Many thanks for working on the regex engine,
this looks like an awesome optimization.
To test the correctness of the patches,
I thought it would be nice with some real-life regexes,
and just as important, some real-life text strings,
to which the real-life regexes are applied to.
I therefore patched Chromium's v8 regexes engine,
to log the actual regexes that get compiled when
visiting websites, and also the text strings that
are the regexes are applied to during run-time
when the regexes are executed.
I logged the regex and text strings as base64 encoded
strings to STDOUT, to make it easy to grep out the data,
so it could be imported into PostgreSQL for analytics.
In total, I scraped the first-page of some ~50k websites,
which produced 45M test rows to import,
which when GROUP BY pattern and flags was reduced
down to 235k different regex patterns,
and 1.5M different text string subjects.
Here are some statistics on the different flags used:
SELECT *, SUM(COUNT) OVER () FROM (SELECT flags, COUNT(*) FROM patterns GROUP BY flags) AS x ORDER BY COUNT DESC;
flags | count | sum
-------+--------+--------
| 150097 | 235204
i | 43537 | 235204
g | 22029 | 235204
gi | 15416 | 235204
gm | 2411 | 235204
gim | 602 | 235204
m | 548 | 235204
im | 230 | 235204
y | 193 | 235204
gy | 60 | 235204
giy | 29 | 235204
giu | 26 | 235204
u | 11 | 235204
iy | 6 | 235204
gu | 5 | 235204
gimu | 2 | 235204
iu | 1 | 235204
my | 1 | 235204
(18 rows)
As we can see, no flag at all is the most common, followed by the "i" flag.
Most of the Javascript-regexes (97%) could be understood by PostgreSQL,
only 3% produced some kind of error, which is not unexpected,
since some Javascript-regex features like \w and \W have different
syntax in PostgreSQL:
SELECT *, SUM(COUNT) OVER () FROM (SELECT is_match,error,COUNT(*) FROM subjects GROUP BY is_match,error) AS x ORDER BY count DESC;
is_match | error | count | sum
----------+---------------------------------------------------------------+--------+---------
f | | 973987 | 1489489
t | | 474225 | 1489489
| invalid regular expression: invalid escape \ sequence | 39141 | 1489489
| invalid regular expression: invalid character range | 898 | 1489489
| invalid regular expression: invalid backreference number | 816 | 1489489
| invalid regular expression: brackets [] not balanced | 327 | 1489489
| invalid regular expression: invalid repetition count(s) | 76 | 1489489
| invalid regular expression: quantifier operand invalid | 17 | 1489489
| invalid regular expression: parentheses () not balanced | 1 | 1489489
| invalid regular expression: regular expression is too complex | 1 | 1489489
(10 rows)
Having had some fun looking at statistics, let's move on to look at if there are any
observable differences between HEAD (8063d0f6f56e53edd991f53aadc8cb7f8d3fdd8f)
and when these two patches have been applied.
To detect any differences,
for each (regex pattern, text string subject) pair,
the columns,
is_match boolean
captured text[]
error text
were set by a PL/pgSQL function running HEAD:
BEGIN
_is_match := _subject ~ _pattern;
_captured := regexp_match(_subject, _pattern);
EXCEPTION WHEN OTHERS THEN
UPDATE subjects SET
error = SQLERRM
WHERE subject_id = _subject_id;
CONTINUE;
END;
UPDATE subjects SET
is_match = _is_match,
captured = _captured
WHERE subject_id = _subject_id;
The patches
0001-invent-rainbow-arcs.patch
0002-recognize-matchall-NFAs.patch
were then applied and this query was executed to spot any differences:
SELECT
is_match <> (subject ~ pattern) AS is_match_diff,
captured IS DISTINCT FROM regexp_match(subject, pattern) AS captured_diff,
COUNT(*)
FROM subjects
WHERE error IS NULL
AND (is_match <> (subject ~ pattern) OR captured IS DISTINCT FROM regexp_match(subject, pattern))
GROUP BY 1,2
ORDER BY 3 DESC
;
The query was first run on the unpatched HEAD to verify it detects no results.
0 rows indeed, and it took this long to finish the query:
Time: 186077.866 ms (03:06.078)
Running the same query with the two patches, was significantly faster:
Time: 111785.735 ms (01:51.786)
No is_match differences were detected, good!
However, there were 23 cases where what got captured differed:
-[ RECORD 1 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (?:^v-([a-z0-9-]+))?(?:(?::|^@|^#)(\[[^\]]+\]|[^\.]+))?(.+)?$
subject | v-cloak
is_match_head | t
captured_head | {cloak,NULL,NULL}
is_match_patch | t
captured_patch | {NULL,NULL,v-cloak}
-[ RECORD 2 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (?:^v-([a-z0-9-]+))?(?:(?::|^@|^#)(\[[^\]]+\]|[^\.]+))?(.+)?$
subject | v-if
is_match_head | t
captured_head | {if,NULL,NULL}
is_match_patch | t
captured_patch | {NULL,NULL,v-if}
-[ RECORD 3 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?a5oc.com).*
subject | https://a5oc.com/attachments/6b184e79-6a7f-43e0-ac59-7ed9d0a8eb7e-jpeg.179582/
is_match_head | t
captured_head | {https://,a5oc.com,NULL <https://%2Ca5oc.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 4 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?allfordmustangs.com).*
subject | https://allfordmustangs.com/attachments/e463e329-0397-4e13-ad41-f30c6bc0659e-jpeg.779299/
is_match_head | t
captured_head | {https://,allfordmustangs.com,NULL <https://%2Callfordmustangs.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 5 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?audi-forums.com).*
subject | https://audi-forums.com/attachments/screenshot_20210207-151100_ebay-jpg.11506/
is_match_head | t
captured_head | {https://,audi-forums.com,NULL <https://%2Caudi-forums.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 6 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?can-amforum.com).*
subject | https://can-amforum.com/attachments/resized_20201214_163325-jpeg.101395/
is_match_head | t
captured_head | {https://,can-amforum.com,NULL <https://%2Ccan-amforum.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 7 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?contractortalk.com).*
subject | https://contractortalk.com/attachments/maryann-porch-roof-quote-12feb2021-jpg.508976/
is_match_head | t
captured_head | {https://,contractortalk.com,NULL <https://%2Ccontractortalk.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 8 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?halloweenforum.com).*
subject | https://halloweenforum.com/attachments/dead-fred-head-before-and-after-jpg.744080/
is_match_head | t
captured_head | {https://,halloweenforum.com,NULL <https://%2Challoweenforum.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 9 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?horseforum.com).*
subject | https://horseforum.com/attachments/dd90f089-9ae9-4521-98cd-27bda9ad38e9-jpeg.1109329/
is_match_head | t
captured_head | {https://,horseforum.com,NULL <https://%2Chorseforum.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 10 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?passatworld.com).*
subject | https://passatworld.com/attachments/clean-passat-jpg.102337/
is_match_head | t
captured_head | {https://,passatworld.com,NULL <https://%2Cpassatworld.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 11 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?plantedtank.net).*
subject | https://plantedtank.net/attachments/brendon-60p-jpg.1026075/
is_match_head | t
captured_head | {https://,plantedtank.net,NULL <https://%2Cplantedtank.net%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 12 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?vauxhallownersnetwork.co.uk).*
subject | https://vauxhallownersnetwork.co.uk/attachments/opelnavi-jpg.96639/
is_match_head | t
captured_head | {https://,vauxhallownersnetwork.co.uk,NULL <https://%2Cvauxhallownersnetwork.co.uk%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 13 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?volvov40club.com).*
subject | https://volvov40club.com/attachments/img_20210204_164157-jpg.17356/
is_match_head | t
captured_head | {https://,volvov40club.com,NULL <https://%2Cvolvov40club.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 14 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?vwidtalk.com).*
subject | https://vwidtalk.com/attachments/1613139846689-png.1469/
is_match_head | t
captured_head | {https://,vwidtalk.com,NULL <https://%2Cvwidtalk.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 15 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?yellowbullet.com).*
subject | https://yellowbullet.com/attachments/20210211_133934-jpg.204604/
is_match_head | t
captured_head | {https://,yellowbullet.com,NULL <https://%2Cyellowbullet.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 16 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^[^\?]*)?(\?[^#]*)?(#.*$)?
subject | https://www.disneyonice.com/oneIdResponder.html
is_match_head | t
captured_head | {https://www.disneyonice.com/oneIdResponder.html,NULL,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 17 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^[a-zA-Z0-9\/_-]+)*(\.[a-zA-Z]+)?
subject | /
is_match_head | t
captured_head | {/,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 18 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^[a-zA-Z0-9\/_-]+)*(\.[a-zA-Z]+)?
subject | /en.html
is_match_head | t
captured_head | {/en,.html}
is_match_patch | t
captured_patch |
-[ RECORD 19 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^https?:\/\/)?(((\[[^\]]+\])|([^:\/\?#]+))(:(\d+))?)?([^\?#]*)(.*)?
subject | https://e.echatsoft.com/mychat/visitor
is_match_head | t
captured_head | {https://,e.echatsoft.com,e.echatsoft.com,NULL,e.echatsoft.com,NULL,NULL,/mychat/visitor <https://%2Ce.echatsoft.com%2Ce.echatsoft.com%2Cnull%2Ce.echatsoft.com%2Cnull%2Cnull%2C/mychat/visitor>,""}
is_match_patch | t
captured_patch | {NULL,https,https,NULL,https,NULL,NULL,://e.echatsoft.com/mychat/visitor,""}
-[ RECORD 20 ]-+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------
pattern | (^|.)41nbc.com$|(^|.)41nbc.dev$|(^|.)52.23.179.12$|(^|.)52.3.245.221$|(^|.)clipsyndicate.com$|(^|.)michaelbgiordano.com$|(^|.)syndicaster.tv$|(^|.)wdef.com$|(^|.)wdef.dev$|(^|.)wxxv.mysiteserver.net$|(^|.)wxxv25.dev$|(^|.)clipsyndicate.com$|(^|.)syndicaster.tv$
subject | wdef.com
is_match_head | t
captured_head | {NULL,NULL,NULL,NULL,NULL,NULL,NULL,"",NULL,NULL,NULL,NULL,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 21 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | ^((^\w+:|^)\/\/)?(?:www\.)?
subject | https://www.deputy.com/
is_match_head | t
captured_head | {https://,https <https://%2Chttps/>:}
is_match_patch | t
captured_patch |
-[ RECORD 22 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | ^((^\w+:|^)\/\/)?(?:www\.)?
subject | https://www.westernsydney.edu.au/
is_match_head | t
captured_head | {https://,https <https://%2Chttps/>:}
is_match_patch | t
captured_patch |
-[ RECORD 23 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | ^(https?:){0,1}\/\/|
subject | https://ui.powerreviews.com/api/
is_match_head | t
captured_head | {https:}
is_match_patch | t
captured_patch | {NULL}
The code to reproduce the results have been pushed here:
https://github.com/truthly/regexes-in-the-wild
Let me know if you want access to the dataset,
I could open up a port to my PostgreSQL so you could take a dump.
SELECT
pg_size_pretty(pg_relation_size('patterns')) AS patterns,
pg_size_pretty(pg_relation_size('subjects')) AS subjects;
patterns | subjects
----------+----------
20 MB | 568 MB
(1 row)
/Joel
"Joel Jacobson" <joel@compiler.org> writes:
In total, I scraped the first-page of some ~50k websites,
which produced 45M test rows to import,
which when GROUP BY pattern and flags was reduced
down to 235k different regex patterns,
and 1.5M different text string subjects.
This seems like an incredibly useful test dataset.
I'd definitely like a copy.
No is_match differences were detected, good!
Cool ...
However, there were 23 cases where what got captured differed:
I shall take a closer look at that.
Many thanks for doing this work!
regards, tom lane
"Joel Jacobson" <joel@compiler.org> writes:
No is_match differences were detected, good!
However, there were 23 cases where what got captured differed:
These all stem from the same oversight: checkmatchall() was being
too cavalier by ignoring "pseudocolor" arcs, which are arcs that
match start-of-string or end-of-string markers. I'd supposed that
pseudocolor arcs necessarily match parallel RAINBOW arcs, because
they start out that way (cf. newnfa). But it turns out that
some edge-of-string constraints can be optimized in such a way that
they only appear in the final NFA in the guise of missing or extra
pseudocolor arcs. We have to actually check that the pseudocolor arcs
match the RAINBOW arcs, otherwise our "matchall" NFA isn't one because
it acts differently at the start or end of the string than it does
elsewhere.
So here's a revised pair of patches (0001 is actually the same as
before).
Thanks again for testing!
regards, tom lane
On Sat, Feb 13, 2021, at 22:11, Tom Lane wrote:
0001-invent-rainbow-arcs-2.patch
0002-recognize-matchall-NFAs-2.patch
I've successfully tested both patches against the 1.5M regexes-in-the-wild dataset.
Out of the 1489489 (pattern, text string) pairs tested,
there was only one single deviation:
This 100577 bytes big regex (pattern_id = 207811)...
\.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mil\.ac| ... |wmflabs\.org|yolasite\.com|za\.net|za\.org)$
...previously raised...
error invalid regular expression: regular expression is too complex
...but now goes through:
is_match <NULL> => t
captured <NULL> => {de}
error invalid regular expression: regular expression is too complex => <NULL>
Nice. The patched regex engine is apparently capable of handling even more complex regexes than before.
The test that found the deviation tests each (pattern, text string) individually,
to catch errors. But I've also made a separate query to just test regexes
known to not cause errors, to allow testing all regexes in one big query,
which fully utilizes the CPU cores and also runs quicker.
Below is the result of the performance test query:
\timing
SELECT
tests.is_match IS NOT DISTINCT FROM (subjects.subject ~ patterns.pattern),
tests.captured IS NOT DISTINCT FROM regexp_match(subjects.subject, patterns.pattern),
COUNT(*)
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
JOIN server_versions ON server_versions.server_version_num = tests.server_version_num
WHERE server_versions.server_version = current_setting('server_version')
AND tests.error IS NULL
GROUP BY 1,2
ORDER BY 1,2;
-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a:
?column? | ?column? | count
----------+----------+---------
t | t | 1448212
(1 row)
Time: 592196.145 ms (09:52.196)
-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches:
?column? | ?column? | count
----------+----------+---------
t | t | 1448212
(1 row)
Time: 461739.364 ms (07:41.739)
That's an impressive 22% speed-up!
/Joel
"Joel Jacobson" <joel@compiler.org> writes:
I've successfully tested both patches against the 1.5M regexes-in-the-wild dataset.
Out of the 1489489 (pattern, text string) pairs tested,
there was only one single deviation:
This 100577 bytes big regex (pattern_id = 207811)...
...
...previously raised...
error invalid regular expression: regular expression is too complex
...but now goes through:
Nice. The patched regex engine is apparently capable of handling even more complex regexes than before.
Yeah. There are various limitations that can lead to REG_ETOOBIG, but the
main ones are "too many states" and "too many arcs". The RAINBOW change
directly reduces the number of arcs and thus makes larger regexes feasible.
I'm sure it's coincidental that the one such example you captured happens
to be fixed by this change, but hey I'll take it.
regards, tom lane
"Joel Jacobson" <joel@compiler.org> writes:
Below is the result of the performance test query:
-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a:
Time: 592196.145 ms (09:52.196)
-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches:
Time: 461739.364 ms (07:41.739)
That's an impressive 22% speed-up!
I've been doing some more hacking over the weekend, and have a couple
of additional improvements to show. The point of these two additional
patches is to reduce the number of "struct subre" sub-regexps that
the regex parser creates. The subre's themselves aren't that large,
so this might seem like it would have only small benefit. However,
each subre requires its own NFA for the portion of the RE that it
matches. That adds space, and it also adds compilation time because
we run the "optimize()" pass separately for each such NFA. Maybe
there'd be a way to share some of that work, but I'm not very clear
how. In any case, not having a subre at all is clearly better where
we can manage it.
0003 is a small patch that fixes up parseqatom() so that it doesn't
emit no-op subre's for empty portions of a regexp branch that are
adjacent to a "messy" regexp atom (that is, a capture node, a
backref, or an atom with greediness different from what preceded it).
0004 is a rather larger patch whose result is to get rid of extra
subre's associated with alternation subre's. If we have a|b|c
and any of those alternation branches are messy, we end up with
*
/ \
a *
/ \
b *
/ \
c NULL
where each "*" is an alternation subre node, and all those "*"'s have
identical NFAs that match the whole a|b|c construct. This means that
for an N-way alternation we're going to need something like O(N^2)
work to optimize all those NFAs. That's embarrassing (and I think
it's my fault --- if memory serves, I put in this representation
of messy alternations years ago).
We can improve matters by having just one parent node for an
alternation:
*
\
a -> b -> c
That requires replacing the binary-tree structure of subre's
with a child-and-sibling arrangement, which is not terribly
difficult but accounts for most of the bulk of the patch.
(I'd wanted to do that for years, but up till now I did not
think it would have any real material benefit.)
There might be more that can be done in this line, but that's
as far as I got so far.
I did some testing on this using your dataset (thanks for
giving me a copy) and this query:
SELECT
pattern,
subject,
is_match AS is_match_head,
captured AS captured_head,
subject ~ pattern AS is_match_patch,
regexp_match(subject, pattern) AS captured_patch
FROM subjects
WHERE error IS NULL
AND (is_match <> (subject ~ pattern)
OR captured IS DISTINCT FROM regexp_match(subject, pattern));
I got these runtimes (non-cassert builds):
HEAD 313661.149 ms (05:13.661)
+0001 297397.293 ms (04:57.397) 5% better than HEAD
+0002 151995.803 ms (02:31.996) 51% better than HEAD
+0003 139843.934 ms (02:19.844) 55% better than HEAD
+0004 95034.611 ms (01:35.035) 69% better than HEAD
Since I don't have all the tables used in your query, I can't
try to reproduce your results exactly. I suspect the reason
I'm getting a better percentage improvement than you did is
that the joining/grouping/ordering involved in your query
creates a higher baseline query cost.
Anyway, I'm feeling pretty pleased with these results ...
regards, tom lane
Attachments:
0001-invent-rainbow-arcs-3.patchtext/x-diff; charset=us-ascii; name=0001-invent-rainbow-arcs-3.patchDownload+166-32
0002-recognize-matchall-NFAs-3.patchtext/x-diff; charset=us-ascii; name=0002-recognize-matchall-NFAs-3.patchDownload+446-7
0003-remove-useless-concat-nodes.patchtext/x-diff; charset=us-ascii; name=0003-remove-useless-concat-nodes.patchDownload+76-7
0004-make-subre-trees-Nary.patchtext/x-diff; charset=us-ascii; name=0004-make-subre-trees-Nary.patchDownload+180-163
On Mon, Feb 15, 2021, at 04:11, Tom Lane wrote:
I got these runtimes (non-cassert builds):
HEAD 313661.149 ms (05:13.661) +0001 297397.293 ms (04:57.397) 5% better than HEAD +0002 151995.803 ms (02:31.996) 51% better than HEAD +0003 139843.934 ms (02:19.844) 55% better than HEAD +0004 95034.611 ms (01:35.035) 69% better than HEADSince I don't have all the tables used in your query, I can't
try to reproduce your results exactly. I suspect the reason
I'm getting a better percentage improvement than you did is
that the joining/grouping/ordering involved in your query
creates a higher baseline query cost.
Mind blowing speed-up, wow!
I've tested all 4 patches successfully.
To eliminate the baseline cost of the join,
I first created this table:
CREATE TABLE performance_test AS
SELECT
subjects.subject,
patterns.pattern,
tests.is_match,
tests.captured
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
JOIN server_versions ON server_versions.server_version_num = tests.server_version_num
WHERE server_versions.server_version = current_setting('server_version')
AND tests.error IS NULL
;
Then I ran this query:
\timing
SELECT
is_match <> (subject ~ pattern),
captured IS DISTINCT FROM regexp_match(subject, pattern),
COUNT(*)
FROM performance_test
GROUP BY 1,2
ORDER BY 1,2
;
All patches gave the same result:
?column? | ?column? | count
----------+----------+---------
f | f | 1448212
(1 row)
I.e., no detected semantic differences.
Timing differences:
HEAD 570632.722 ms (09:30.633)
+0001 472938.857 ms (07:52.939) 17% better than HEAD
+0002 451638.049 ms (07:31.638) 20% better than HEAD
+0003 439377.813 ms (07:19.378) 23% better than HEAD
+0004 96447.038 ms (01:36.447) 83% better than HEAD
I tested on my MacBook Pro 2.4GHz 8-Core Intel Core i9, 32 GB 2400 MHz DDR4 running macOS Big Sur 11.1:
SELECT version();
version
----------------------------------------------------------------------------------------------------------------------
PostgreSQL 14devel on x86_64-apple-darwin20.2.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
(1 row)
My HEAD = 46d6e5f567906389c31c4fb3a2653da1885c18ee.
PostgreSQL was compiled with just ./configure, no parameters, and the only non-default postgresql.conf settings were these:
log_destination = 'csvlog'
logging_collector = on
log_filename = 'postgresql.log'
Amazing work!
I hope to have a new dataset ready soon with regex flags for applied subjects as well.
/Joel
"Joel Jacobson" <joel@compiler.org> writes:
I've tested all 4 patches successfully.
Thanks!
I found one other area that could be improved with the same idea of
getting rid of unnecessary subre's: right now, every pair of capturing
parentheses gives rise to a "capture" subre with an NFA identical to
its single child subre (which is what does the actual matching work).
While this doesn't really add any runtime cost, the duplicate NFA
definitely does add to the compilation cost, since we run it through
optimization independently of the child.
I initially thought that we could just flush capture subres altogether
in favor of annotating their children with a "we need to capture this
result" marker. However, Spencer's regression tests soon exposed the
flaw in that notion. It's legal to write "((x))" or even "((((x))))",
so that there can be multiple sets of capturing parentheses with a
single child node. The solution adopted in the attached 0005 is to
handle the innermost capture with a marker on the child subre, but if
we need an additional capture on a node that's already marked, put
a capture subre on top just like before. One could instead complicate
the data structure by allowing N capture markers on a single subre
node, but I judged that not to be a good tradeoff. I don't see any
reason except spec compliance to allow such equivalent capture groups,
so I don't care if they're a bit inefficient. (If anyone knows of a
useful application for writing REs like this, we could reconsider that
choice.)
One small issue with marking the child directly is that we can't get
away any longer with overlaying capture and backref subexpression
numbers, since you could theoretically write (\1) which'd result in
needing to put a capture label on a backref subre. This could again
have been handled by making the capture a separate node, but I really
don't much care for the way that subre.subno has been overloaded for
three(!) different purposes depending on node type. So I just broke
it into three separate fields. Maybe the incremental cost of the
larger subre struct was worth worrying about back in 1997 ... but
I kind of doubt that it was a useful micro-optimization even then,
considering the additional NFA baggage that every subre carries.
Also, I widened "subre.id" from short to int, since the narrower field
no longer saves anything given the new struct layout. The existing
choice was dubious already, because every other use of subre ID
numbers was int or even size_t, and there was nothing checking for
overflow of the id fields. (Although perhaps it doesn't matter,
since I'm unsure that the id fields are really used for anything
except debugging purposes.)
For me, 0005 makes a fairly perceptible difference on your test case
subject_id = 611875, which I've been paying attention to because it's
the one that failed with "regular expression is too complex" before.
I see about a 20% time savings from 0004 on that case, but not really
any noticeable difference in the total runtime for the whole suite.
So I think we're getting to the point of diminishing returns for
this concept (another reason for not chasing after optimization of
the duplicate-captures case). Still, we're clearly way ahead of
where we started.
Attached is an updated patch series; it's rebased over 4e703d671
which took care of some not-really-related fixes, and I made a
pass of cleanup and comment improvements. I think this is pretty
much ready to commit, unless you want to do more testing or
code-reading.
regards, tom lane
Attachments:
0001-invent-rainbow-arcs-4.patchtext/x-diff; charset=us-ascii; name=0001-invent-rainbow-arcs-4.patchDownload+177-37
0002-recognize-matchall-NFAs-4.patchtext/x-diff; charset=us-ascii; name=0002-recognize-matchall-NFAs-4.patchDownload+376-1
0003-remove-useless-concat-nodes-4.patchtext/x-diff; charset=us-ascii; name=0003-remove-useless-concat-nodes-4.patchDownload+76-7
0004-make-subre-trees-Nary-4.patchtext/x-diff; charset=us-ascii; name=0004-make-subre-trees-Nary-4.patchDownload+182-165
0005-remove-separate-capture-nodes-4.patchtext/x-diff; charset=us-ascii; name=0005-remove-separate-capture-nodes-4.patchDownload+74-36
On Wed, Feb 17, 2021, at 22:00, Tom Lane wrote:
Attached is an updated patch series; it's rebased over 4e703d671
which took care of some not-really-related fixes, and I made a
pass of cleanup and comment improvements. I think this is pretty
much ready to commit, unless you want to do more testing or
code-reading.
I've produced a new dataset which now also includes the regex flags (if any) used for each subject applied to a pattern.
The new dataset contains 318364 patterns and 4474520 subjects.
(The old one had 235204 patterns and 1489489 subjects.)
I've tested the new dataset against PostgreSQL 10.16, 11.11, 12.6, 13.2, HEAD (4e703d671) and HEAD+patches.
I based the comparisons on the subjects that didn't cause an error on 13.2:
CREATE TABLE performance_test AS
SELECT
subjects.subject,
patterns.pattern,
patterns.flags,
tests.is_match,
tests.captured
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
WHERE tests.error IS NULL
;
I then measured the query below for each PostgreSQL version:
\timing
SELECT version();
SELECT
is_match <> (subject ~ pattern) AS is_match_diff,
captured IS DISTINCT FROM regexp_match(subject, pattern, flags) AS captured_diff,
COUNT(*)
FROM performance_test
GROUP BY 1,2
ORDER BY 1,2
;
All versions produces the same result:
is_match_diff | captured_diff | count
---------------+---------------+---------
f | f | 3254769
(1 row)
Good! Not a single case that differs of over 3 million different regex pattern/subject combinations,
between five major PostgreSQL versions! That's a very stable regex engine.
To get a feeling for the standard deviation of the timings,
I executed the same query above three times for each PostgreSQL version:
PostgreSQL 10.16 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.2 (clang-700.1.81), 64-bit
Time: 795674.830 ms (13:15.675)
Time: 794249.704 ms (13:14.250)
Time: 771036.707 ms (12:51.037)
PostgreSQL 11.11 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit
Time: 765466.191 ms (12:45.466)
Time: 787135.316 ms (13:07.135)
Time: 779582.635 ms (12:59.583)
PostgreSQL 12.6 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit
Time: 785500.516 ms (13:05.501)
Time: 784511.591 ms (13:04.512)
Time: 786727.973 ms (13:06.728)
PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
Time: 758514.703 ms (12:38.515)
Time: 755883.600 ms (12:35.884)
Time: 746522.107 ms (12:26.522)
PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
HEAD (4e703d671)
Time: 519620.646 ms (08:39.621)
Time: 518998.366 ms (08:38.998)
Time: 519696.129 ms (08:39.696)
HEAD (4e703d671)+0001+0002+0003+0004+0005
Time: 141290.329 ms (02:21.290)
Time: 141849.709 ms (02:21.850)
Time: 141630.819 ms (02:21.631)
That's a mind-blowing speed-up!
I also ran the more detailed test between 13.2 and HEAD+patches,
that also tests for differences in errors.
Like before, one similar improvement was found,
which previously resulted in an error, but now goes through OK:
SELECT * FROM vdeviations;
-[ RECORD 1 ]----+-------------------------------------------------------------------------------------------------------
pattern | \.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mi ... 100497 chars ... abs\.org|yolasite\.com|za\.net|za\.org)$
flags |
subject | www.aeroexpo.online
count | 1
a_server_version | 13.2
a_duration | 00:00:00.298253
a_is_match |
a_captured |
a_error | invalid regular expression: regular expression is too complex
b_server_version | 14devel
b_duration | 00:00:00.665958
b_is_match | t
b_captured | {online}
b_error |
Very nice.
I've uploaded the new dataset to the same place as before.
The schema for it can be found at https://github.com/truthly/regexes-in-the-wild
If anyone else would like a copy of the 715MB dataset, please let me know.
/Joel
On Thu, Feb 18, 2021, at 11:30, Joel Jacobson wrote:
SELECT * FROM vdeviations;
-[ RECORD 1 ]----+-------------------------------------------------------------------------------------------------------
pattern | \.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mi ... 100497 chars ... abs\.org|yolasite\.com|za\.net|za\.org)$
Heh, what a funny coincidence:
The regex I used to shrink the very-long-pattern,
actually happens to run a lot faster with the patches.
I noticed it when trying to read from the vdeviations view in PostgreSQL 13.2.
Here is my little helper-function which I used to shrink patterns/subjects longer than N characters:
CREATE OR REPLACE FUNCTION shrink_text(text,integer) RETURNS text LANGUAGE sql AS $$
SELECT CASE WHEN length($1) < $2 THEN $1 ELSE
format('%s ... %s chars ... %s', m[1], length(m[2]), m[3])
END
FROM (
SELECT regexp_matches($1,format('^(.{1,%1$s})(.*?)(.{1,%1$s})$',$2/2)) AS m
) AS q
$$;
The regex aims to produce three capture groups,
where I wanted the first and third ones to be greedy
and match up to $2 characters (controlled by the second input param to the function),
and the second capture group in the middle to be non-greedy,
but match the remainder to make up a fully anchored match.
It works like expected in both 13.2 and HEAD+patches, but the speed-up it enormous:
PostgreSQL 13.2:
EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');
QUERY PLAN
-------------------------------------------------------------------------------------------------
ProjectSet (cost=0.00..0.02 rows=1 width=32) (actual time=23600.816..23600.838 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1)
Planning Time: 0.432 ms
Execution Time: 23600.859 ms
(4 rows)
HEAD+0001+0002+0003+0004+0005:
EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');
QUERY PLAN
-------------------------------------------------------------------------------------------
ProjectSet (cost=0.00..0.02 rows=1 width=32) (actual time=36.656..36.661 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.002 rows=1 loops=1)
Planning Time: 0.575 ms
Execution Time: 36.689 ms
(4 rows)
Cool stuff.
/Joel
"Joel Jacobson" <joel@compiler.org> writes:
I've produced a new dataset which now also includes the regex flags (if
any) used for each subject applied to a pattern.
Again, thanks for collecting this data! I'm a little confused about
how you produced the results in the "tests" table, though. It sort
of looks like you tried to feed the Javascript flags to regexp_match(),
which unsurprisingly doesn't work all that well. Even discounting
that, I'm not getting quite the same results, and I don't understand
why not. So how was that made from the raw "patterns" and "subjects"
tables?
PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
Time: 758514.703 ms (12:38.515)
Time: 755883.600 ms (12:35.884)
Time: 746522.107 ms (12:26.522)PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
HEAD (4e703d671)
Time: 519620.646 ms (08:39.621)
Time: 518998.366 ms (08:38.998)
Time: 519696.129 ms (08:39.696)
Hmmm ... we haven't yet committed any performance-relevant changes to the
regex code, so it can't take any credit for this improvement from 13.2 to
HEAD. I speculate that this is due to some change in our parallelism
stuff (since I observe that this query is producing a parallelized hash
plan). Still, the next drop to circa 2:21 runtime is impressive enough
by itself.
Heh, what a funny coincidence:
The regex I used to shrink the very-long-pattern,
actually happens to run a lot faster with the patches.
Yeah, that just happens to be a poster child for the MATCHALL idea:
EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');
Each of the parenthesized subexpressions of the RE is successfully
recognized as being MATCHALL, with length range 1..80 for two of them and
0..infinity for the middle one. That means the engine doesn't have to
physically scan the text to determine whether a possible division point
satisfies the sub-regexp; and that means we can find the correct division
points in O(N) not O(N^2) time.
regards, tom lane
I thought it was worth looking a little more closely at the error
cases in this set of tests, as a form of competition analysis versus
Javascript's regex engine. I ran through the cases that gave errors,
and pinned down exactly what was causing the error for as many cases
as I could. (These results are from your first test corpus, but
I doubt the second one would give different conclusions.)
We have these errors reported in the test corpus:
error | count
-----------------------------------+-------
invalid escape \ sequence | 39141
invalid character range | 898
invalid backreference number | 816
brackets [] not balanced | 327
invalid repetition count(s) | 76
quantifier operand invalid | 17
parentheses () not balanced | 1
regular expression is too complex | 1
The existing patchset takes care of the one "regular expression is too
complex" failure. Of the rest:
It turns out that almost 39000 of the "invalid escape \ sequence"
errors are due to use of \D, \S, or \W within a character class.
We support the positive-class shorthands \d, \s, \w there, but not
their negations. I think that this might be something that Henry
Spencer just never got around to; I don't see any fundamental reason
we can't allow it, although some refactoring might be needed in the
regex lexer. Given the apparent popularity of this notation, maybe
we should put some work into that.
(Having said that, I can't help noticing that a very large fraction
of those usages look like, eg, "[\w\W]". It seems to me that that's
a very expensive and unwieldy way to spell ".". Am I missing
something about what that does in Javascript?)
About half of the remaining escape-sequence complaints seem to be due
to just randomly backslashing alphanumeric characters that don't need
it, as for example "i" in "\itunes\.apple\.com". Apparently
Javascript is content to take "\i" as just meaning "i". Our engine
rejects that, with a view to keeping such combinations reserved for
future definition. That's fine by me so I don't want to change it.
Of the rest, many are abbreviated numeric escapes, eg "\u45" where our
engine wants to see "\u0045". I don't think being laxer about that
would be a great idea either.
Lastly, there are some occurrences like "[\1]", which in context look
like the \1 might be intended as a back-reference? But I don't really
understand what that's supposed to do inside a bracket expression.
The "invalid character range" errors seem to be coming from constructs
like "[A-Za-z0-9-/]", which our engine rejects because it looks like
a messed-up character range.
All but 123 of the "invalid backreference number" complaints stem from
using backrefs inside lookahead constraints. Some of the rest look
like they think you can put capturing parens inside a lookahead
constraint and then backref that. I'm not really convinced that such
constructs have a well-defined meaning. (I looked at the ECMAscript
definition of regexes, and they do say it's allowed, but when trying
to define it they resort to handwaving about backtracking; at best that
is a particularly lame version of specification by implementation.)
Spencer chose to forbid these cases in our engine, and I think there
are very good implementation reasons why it won't work. Perhaps we
could provide a clearer error message about it, though.
307 of the "brackets [] not balanced" errors, as well as the one
"parentheses () not balanced" error, seem to trace to the fact that
Javascript considers "[]" to be a legal empty character class, whereas
POSIX doesn't allow empty character classes so our engine takes the
"]" literally, and then looks for a right bracket it won't find.
(That is, in POSIX "[]x]" is a character class matching ']' and 'x'.)
Maybe I'm misinterpreting this too, because if I read the
documentation correctly, "[]" in Javascript matches nothing, making
it impossible for the regex to succeed. Why would such a construct
appear this often?
The remainder of the bracket errors happen because in POSIX, the
sequences "[:", "[=", and "[." within a bracket expression introduce
special syntax, whereas in Javascript '[' is just an ordinary data
character within a bracket expression. Not much we can do here; the
standards are just incompatible.
All but 3 of the "invalid repetition count(s)" errors come from
quantifiers larger than our implementation limit of 255. A lot of
those are exactly 256, though I saw one as high as 3000. The
remaining 3 errors are from syntax like "[0-9]{0-3}", which is a
syntax error according to our engine ("[0-9]{0,3}" is correct).
AFAICT it's not a valid quantifier according to Javascript either;
perhaps that engine is just taking the "{0-3}" as literal text?
Given this, it seems like there's a fairly strong case for increasing
our repetition-count implementation limit, at least to 256, and maybe
1000 or so. I'm hesitant to make the limit *really* large, but if
we can handle a regex containing thousands of "x"'s, it's not clear
why you shouldn't be able to write that as "x{0,1000}".
All of the "quantifier operand invalid" errors come from these
three patterns:
((?!\\)?\{0(?!\\)?\})
((?!\\)?\{1(?!\\)?\})
class="(?!(tco-hidden|tco-display|tco-ellipsis))+.*?"|data-query-source=".*?"|dir=".*?"|rel=".*?"
which are evidently trying to apply a quantifier to a lookahead
constraint, which is just silly.
In short, a lot of this is from incompatible standards, or maybe
from varying ideas about whether to throw an error for invalid
constructs. But I see a couple things we could improve.
regards, tom lane
On Thu, Feb 18, 2021, at 19:10, Tom Lane wrote:
"Joel Jacobson" <joel@compiler.org> writes:
I've produced a new dataset which now also includes the regex flags (if
any) used for each subject applied to a pattern.Again, thanks for collecting this data! I'm a little confused about
how you produced the results in the "tests" table, though. It sort
of looks like you tried to feed the Javascript flags to regexp_match(),
which unsurprisingly doesn't work all that well.
That's exactly what I did. Some of the flags work the same between Javascript and PostgreSQL, others don't.
I thought maybe something interesting would surface in just trying them blindly.
Flags that aren't supported and gives errors are reported as tests where error is not null.
Most patterns have no flags, and second most popular is just the "i" flag, which should work the same.
SELECT flags, COUNT(*) FROM patterns GROUP BY 1 ORDER BY 2 DESC;
flags | count
-------+--------
| 151927
i | 120336
gi | 26057
g | 13263
gm | 4606
gim | 699
im | 491
y | 367
m | 365
gy | 105
u | 50
giy | 38
giu | 20
gimu | 14
iy | 11
iu | 6
gimy | 3
gu | 2
gmy | 2
imy | 1
my | 1
(21 rows)
This query shows what Javascript-regex-flags that could be used as-is without errors:
SELECT
patterns.flags,
COUNT(*)
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
WHERE tests.error IS NULL
GROUP BY 1
ORDER BY 2;
flags | count
-------+---------
im | 2534
m | 4460
i | 543598
| 2704177
(4 rows)
I considered filtering/converting the flags to PostgreSQL,
maybe that would be an interesting approach to try as well.
Even discounting
that, I'm not getting quite the same results, and I don't understand
why not. So how was that made from the raw "patterns" and "subjects"
tables?
The rows in the tests table were generated by the create_regexp_tests() function [1]https://github.com/truthly/regexes-in-the-wild/blob/master/create_regexp_tests.sql
Each subject now has a foreign key to a specific pattern,
where the (pattern, flags) combination are unique in patterns.
The actual unique constraint is on (pattern_hash, flags) to avoid
an index directly on pattern which can be huge as we've seen.
So, for each subject, it is known via the pattern_id
exactly what flags were used when the regex was compiled
(and later executed/applied with the subject).
[1]: https://github.com/truthly/regexes-in-the-wild/blob/master/create_regexp_tests.sql
PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
Time: 758514.703 ms (12:38.515)
Time: 755883.600 ms (12:35.884)
Time: 746522.107 ms (12:26.522)PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
HEAD (4e703d671)
Time: 519620.646 ms (08:39.621)
Time: 518998.366 ms (08:38.998)
Time: 519696.129 ms (08:39.696)Hmmm ... we haven't yet committed any performance-relevant changes to the
regex code, so it can't take any credit for this improvement from 13.2 to
HEAD. I speculate that this is due to some change in our parallelism
stuff (since I observe that this query is producing a parallelized hash
plan). Still, the next drop to circa 2:21 runtime is impressive enough
by itself.
OK. Another factor might perhaps be the PostgreSQL 10, 11, 12, 13 versions were compiled elsewhere,
I used the OS X binaries from https://postgresapp.com/, whereas version 14 I of course compiled myself.
Maybe I should have compiled 10, 11, 12, 13 myself instead, for a better comparison,
but I mostly just wanted to verify if I could find any differences, the performance comparison was a bonus.
Heh, what a funny coincidence:
The regex I used to shrink the very-long-pattern,
actually happens to run a lot faster with the patches.Yeah, that just happens to be a poster child for the MATCHALL idea:
EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');
Each of the parenthesized subexpressions of the RE is successfully
recognized as being MATCHALL, with length range 1..80 for two of them and
0..infinity for the middle one. That means the engine doesn't have to
physically scan the text to determine whether a possible division point
satisfies the sub-regexp; and that means we can find the correct division
points in O(N) not O(N^2) time.
Very nice.
Like you said earlier, perhaps the regex engine has been optimized enough for this time.
If not, you want to investigate an additional idea,
that I think can be seen as a generalization of the optimization trick for (.*),
if I've understood how it works correctly.
Let's see if I can explain the idea:
One of the problems with representing regexes with large bracket range expressions, like [a-z],
is you get an explosion of edges, if the model can only represent state transitions for single characters.
If we could instead let a single edge (for a state transition) represent a set of characters,
or normally even more efficiently, a set of range of characters, then we could reduce the
number of edges we need to represent the graph.
The naive approach to just use the ranges as-is doesn't work.
Instead, the graph must first be created with single-character edges.
It is then examined what ranges can be constructed in a way that no single range
overlaps any other range, so that every range can be seen as a character in an alphabet.
Perhaps a bit of fiddling with some examples is easiest
to get a grip of the idea.
Here is a live demo of the idea:
https://compiler.org/reason-re-nfa/src/index.html
The graphs are rendered live when typing in the regex,
using a Javascript port of GraphViz.
For example, try entering the regex: t[a-z]*m
This generates this range-optimized graph for the regex:
/--[a-ln-su-z]-----------------\
|/------t--------------------\ |
|| | |
-->(0)--t-->({0,1})----m-------->({0 1 2}) | |
^---[a-ln-su-z]--/ | |
^-------t-------/ | |
^---------------------------/ |
^-----------------------------/
Notice how the [a-z] bracket expression has been split up,
and we now have 3 distinct set of "ranges":
t
m
[a-ln-su-z]
Since no ranges are overlapping, each such range can safely be seen as a letter in an alphabet.
Once we have our final graph, but before we proceed to generate the machine code for it,
we can shrink the graph further by merging ranges together, which eliminate some edges:
/--------------\
| |
--->(0)--t-->(1)<--[a-ln-z]--/
|^-[a-lnz]-\
\----m-->((2))<----\
| |
\---m---/
Notice how [a-ln-su-z]+t becomes [a-ln-z].
Another optimization I've come up with (or probably re-invented because it feels quite obvious),
is to read more than one character, when knowing for sure multiple characters-in-a-row
are expected, by concatenating edges having only one parent and one child.
In our example, we know for sure at least two characters will be read for the regex t[a-z]*m,
so with this optimization enabled, we get this graph:
/--[a-ln-z]
| |
--->(0)---t[a-ln-z]--->(1)<---+--[a-ln-z]
| | /
| \---m--->((2))<------\
\--------------tm------------^ | |
\----m----/
This makes not much difference for a few characters,
but if we have a long pattern with a long sentence
that is repeated, we could e.g. read in 32 bytes
and compare them all in one operation,
if our machine had 256-bits SIMD registers/instructions.
This idea has also been implemented in the online demo.
There is a level which can be adjusted
from 0 to 32 to control how many bytes to merge at most,
located in the "[+]dfa5 = merge_linear(dfa4)" step.
Anyway, I can totally understand if you've had enough of regex optimizations for this time,
but in case not, I wanted to share my work in this field, in case it's interesting to look at now or in the future.
/Joel
On Thu, Feb 18, 2021, at 20:58, Joel Jacobson wrote:
Like you said earlier, perhaps the regex engine has been optimized enough for this time.
If not, you want to investigate an additional idea,
In the above sentence, I meant "you _may_ want to".
I'm not at all sure these idea are applicable in the PostgreSQL regex engine,
so feel free to silently ignore these if you feel there is a risk for time waste.
that I think can be seen as a generalization of the optimization trick for (.*),
if I've understood how it works correctly.
Actually not sure if it can be seen as a generalization,
I just came to think of my ideas since they also improve
the case when you have lots of (.*) or bracket expressions of large ranges.
/Joel
"Joel Jacobson" <joel@compiler.org> writes:
Let's see if I can explain the idea:
One of the problems with representing regexes with large bracket range expressions, like [a-z],
is you get an explosion of edges, if the model can only represent state transitions for single characters.
If we could instead let a single edge (for a state transition) represent a set of characters,
or normally even more efficiently, a set of range of characters, then we could reduce the
number of edges we need to represent the graph.
The naive approach to just use the ranges as-is doesn't work.
Instead, the graph must first be created with single-character edges.
It is then examined what ranges can be constructed in a way that no single range
overlaps any other range, so that every range can be seen as a character in an alphabet.
Hmm ... I might be misunderstanding, but I think our engine already
does a version of this. See the discussion of "colors" in
src/backend/regex/README.
Another optimization I've come up with (or probably re-invented because it feels quite obvious),
is to read more than one character, when knowing for sure multiple characters-in-a-row
are expected, by concatenating edges having only one parent and one child.
Maybe. In practice the actual scanning tends to be tracking more than one
possible NFA state in parallel, so I'm not sure how often we could expect
to be able to use this idea. That is, even if we know that state X can
only succeed by following an arc to Y and then another to Z, we might
also be interested in what happens if the NFA is in state Q at this point;
and it seems unlikely that Q would have exactly the same two following
arc colors.
I do have some ideas about possible future optimizations, and one reason
I'm grateful for this large set of real regexes is that it can provide a
concrete basis for deciding that particular optimizations are or are not
worth pursuing. So thanks again for collecting it!
regards, tom lane
On Thu, Feb 18, 2021, at 21:44, Tom Lane wrote:
Hmm ... I might be misunderstanding, but I think our engine already
does a version of this. See the discussion of "colors" in
src/backend/regex/README.
Thanks, I will read it with great interest.
Maybe. In practice the actual scanning tends to be tracking more than one
possible NFA state in parallel, so I'm not sure how often we could expect
to be able to use this idea. That is, even if we know that state X can
only succeed by following an arc to Y and then another to Z, we might
also be interested in what happens if the NFA is in state Q at this point;
and it seems unlikely that Q would have exactly the same two following
arc colors.
Right. Actually I don't have a clear idea on how it could be implemented in an NFA engine.
I do have some ideas about possible future optimizations, and one reason
I'm grateful for this large set of real regexes is that it can provide a
concrete basis for deciding that particular optimizations are or are not
worth pursuing. So thanks again for collecting it!
My pleasure. Thanks for using it!
/Joel
On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:
(Having said that, I can't help noticing that a very large fraction
of those usages look like, eg, "[\w\W]". It seems to me that that's
a very expensive and unwieldy way to spell ".". Am I missing
something about what that does in Javascript?)
This popular regex
^(?:\s*(<[\w\W]+>)[^>]*|#([\w-]+))$
is coming from jQuery:
// A simple way to check for HTML strings
// Prioritize #id over <tag> to avoid XSS via location.hash (#9521)
// Strict HTML recognition (#11290: must start with <)
// Shortcut simple #id case for speed
rquickExpr = /^(?:\s*(<[\w\W]+>)[^>]*|#([\w-]+))$/,
From: https://code.jquery.com/jquery-3.5.1.js
I think this is a non-POSIX hack to match any character, including newlines,
which are not included unless the "s" flag is set.
Javascript test:
"foo\nbar".match(/(.+)/)[1];
"foo"
"foo\nbar".match(/(.+)/s)[1];
"foo
bar"
"foo\nbar".match(/([\w\W]+)/)[1];
"foo
bar"
/Joel
"Joel Jacobson" <joel@compiler.org> writes:
On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:
(Having said that, I can't help noticing that a very large fraction
of those usages look like, eg, "[\w\W]". It seems to me that that's
a very expensive and unwieldy way to spell ".". Am I missing
something about what that does in Javascript?)
I think this is a non-POSIX hack to match any character, including newlines,
which are not included unless the "s" flag is set.
"foo\nbar".match(/([\w\W]+)/)[1];
"foo
bar"
Oooh, that's very interesting. I guess the advantage of that over using
the 's' flag is that you can have different behaviors at different places
in the same regex.
I was just wondering about this last night in fact, while hacking on
the code to get it to accept \W etc in bracket expressions. I see that
right now, our code thinks that NLSTOP mode ('n' switch, the opposite
of 's') should cause \W \D \S to not match newline. That seems a little
weird, not least because \S should probably be different from the other
two, and it isn't. And now we see it'd mean that you couldn't use the 'n'
switch to duplicate Javascript's default behavior in this area. Should we
change it? (I wonder what Perl does.)
regards, tom lane
On Fri, Feb 19, 2021, at 16:26, Tom Lane wrote:
"Joel Jacobson" <joel@compiler.org> writes:
On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:
(Having said that, I can't help noticing that a very large fraction
of those usages look like, eg, "[\w\W]". It seems to me that that's
a very expensive and unwieldy way to spell ".". Am I missing
something about what that does in Javascript?)I think this is a non-POSIX hack to match any character, including newlines,
which are not included unless the "s" flag is set."foo\nbar".match(/([\w\W]+)/)[1];
"foo
bar"Oooh, that's very interesting. I guess the advantage of that over using
the 's' flag is that you can have different behaviors at different places
in the same regex.
I would guess the same thing.
I was just wondering about this last night in fact, while hacking on
the code to get it to accept \W etc in bracket expressions. I see that
right now, our code thinks that NLSTOP mode ('n' switch, the opposite
of 's') should cause \W \D \S to not match newline. That seems a little
weird, not least because \S should probably be different from the other
two, and it isn't. And now we see it'd mean that you couldn't use the 'n'
switch to duplicate Javascript's default behavior in this area. Should we
change it? (I wonder what Perl does.)regards, tom lane
To allow comparing PostgreSQL vs Javascript vs Perl,
I installed three helper-functions using plv8 and plperl,
and also one convenience function for PostgreSQL
to catch errors and return the error string instead:
The string used in this test is "foo!\n!bar",
which aims to detect differences in how new-lines
and non alpha-number characters are handled.
To allow PostgreSQL to be compared with Javascript and Perl,
the "n" flag is used for PostgreSQL when no flags are used for Javascript/Perl,
and no flag for PostgreSQL when the "s" flag is used for Javascript/Perl,
for the results to be comparable.
In Javascript, when a regex contains capture groups, the entire match
is always returns as the first array element.
To make it easier to visually compare the results,
the first element is removed from Javascript,
which works in this test since all regexes contain
exactly one capture group.
Here are the results:
$ psql -e -f not_alnum.sql regex
SELECT
regexp_match_pg(E'foo!\n!bar', '(.+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '(.+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '(.+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo!} | {foo!} | {foo!}
(1 row)
SELECT
regexp_match_pg(E'foo!\n!bar', '(.+)', ''),
(regexp_match_v8(E'foo!\n!bar', '(.+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '(.+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{"foo! +| {"foo! +| {"foo! +
!bar"} | !bar"} | !bar"}
(1 row)
SELECT
regexp_match_pg(E'foo!\n!bar', '([\w\W]+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '([\w\W]+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '([\w\W]+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
------------------------------------------------------------+-----------------+-----------------
{"invalid regular expression: invalid escape \\ sequence"} | {"foo! +| {"foo! +
| !bar"} | !bar"}
(1 row)
SELECT
regexp_match_pg(E'foo!\n!bar', '([\w\W]+)', ''),
(regexp_match_v8(E'foo!\n!bar', '([\w\W]+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '([\w\W]+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
------------------------------------------------------------+-----------------+-----------------
{"invalid regular expression: invalid escape \\ sequence"} | {"foo! +| {"foo! +
| !bar"} | !bar"}
(1 row)
SELECT
regexp_match_pg(E'foo!\n!bar', '([\w]+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '([\w]+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '([\w]+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo} | {foo} | {foo}
(1 row)
SELECT
regexp_match_pg(E'foo!\n!bar', '([\w]+)', ''),
(regexp_match_v8(E'foo!\n!bar', '([\w]+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '([\w]+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo} | {foo} | {foo}
(1 row)
SELECT
regexp_match_pg(E'foo!\n!bar', '([\W]+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '([\W]+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '([\W]+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
------------------------------------------------------------+-----------------+-----------------
{"invalid regular expression: invalid escape \\ sequence"} | {"! +| {"! +
| !"} | !"}
(1 row)
SELECT
regexp_match_pg(E'foo!\n!bar', '([\W]+)', ''),
(regexp_match_v8(E'foo!\n!bar', '([\W]+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '([\W]+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
------------------------------------------------------------+-----------------+-----------------
{"invalid regular expression: invalid escape \\ sequence"} | {"! +| {"! +
| !"} | !"}
(1 row)
SELECT
regexp_match_pg(E'foo!\n!bar', '(\w+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '(\w+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '(\w+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo} | {foo} | {foo}
(1 row)
SELECT
regexp_match_pg(E'foo!\n!bar', '(\w+)', ''),
(regexp_match_v8(E'foo!\n!bar', '(\w+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '(\w+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo} | {foo} | {foo}
(1 row)
SELECT
regexp_match_pg(E'foo!\n!bar', '(\W+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '(\W+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '(\W+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{!} | {"! +| {"! +
| !"} | !"}
(1 row)
SELECT
regexp_match_pg(E'foo!\n!bar', '(\W+)', ''),
(regexp_match_v8(E'foo!\n!bar', '(\W+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '(\W+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{"! +| {"! +| {"! +
!"} | !"} | !"}
(1 row)
/Joel