Another regexp performance improvement: skip useless paren-captures

Started by Tom Lanealmost 5 years ago31 messageshackers

tgl@sss.pgh.pa.us

almost 5 years ago

Here's a little finger exercise that improves a case that's bothered me
for awhile. In a POSIX regexp, parentheses cause capturing by default;
you have to write the very non-obvious "(?:...)" if you don't want the
matching substring to be reported by the regexp engine. That'd be fine
if capturing were cheap, but with our engine it is not particularly
cheap. In many situations, the initial DFA check is sufficient to
tell whether there is an overall match, but it does not tell where any
subexpression match boundaries are. To identify exactly which substring
is deemed to match a parenthesized subexpression, we have to recursively
break down the match, which takes at the very least a few more DFA
invocations; and with an uncooperative regex, it can easily result in
O(N^2) behavior where there was none at the DFA stage.

Therefore, we really ought to expend some effort to not capture
subexpressions if the sub-match data is not actually needed, which in
many invocations we know that it isn't. Spencer's original code has
a REG_NOSUB option that looks like it ought to be good for this ... but
on closer inspection it's basically useless, because it turns *all*
parens into non-capturing ones. That breaks back-references, so unless
you know that the regexp contains no back-refs, you can't use it.

The attached proposed patch redefines REG_NOSUB as being a regexp-
compile-time promise that the caller doesn't care about sub-match
locations, but not a promise that no backrefs exist. (If the
caller passes a match-locations array at execution anyway, it will
just get back -1 values, as if no sub-match had been identified.)
If that flag is passed, we run through the completed sub-regexp
tree and remove the "capture" markers on any subREs that are
not actually referenced by some backref. This typically causes
some parent subREs to no longer be deemed "messy", so that their
separate child subREs can be thrown away entirely, saving memory
space as well as runtime.

(I'd originally thought that a much more complex patch would be
needed to do this, because I assumed that re-optimizing the subRE
tree would be much more complicated than this. However, as far
as I can see this is sufficient; this change doesn't expose any
cases where additional tree restructuring would be helpful.)

Testing with Joel's handy little corpus of web regexps, there's a
useful improvement of the speed of ~ operators (a/k/a regexp_like()).
I see the total time to apply regexp_like() to all 4474520 entries
dropping from 10:17 to 5:46. Interesting statistics include

regexp=# select max(duration),avg(duration) from headresults;
max | avg
-----------------+-----------------
00:00:00.939389 | 00:00:00.000138
(1 row)

regexp=# select max(duration),avg(duration) from patchresults;
max | avg
-----------------+-----------------
00:00:00.918549 | 00:00:00.000077
(1 row)

The lower percentiles don't move much, but upper ones do:

regexp=# select percentile_cont(array[0.5,0.75,0.8,0.9]) within group(order by duration) from headresults;
percentile_cont
-------------------------------------------------------------------
{00:00:00.000027,00:00:00.000059,00:00:00.000067,00:00:00.000108}
(1 row)

regexp=# select percentile_cont(array[0.5,0.75,0.8,0.9]) within group(order by duration) from patchresults;
percentile_cont
-------------------------------------------------------------------
{00:00:00.000025,00:00:00.000042,00:00:00.000048,00:00:00.000065}
(1 row)

This isn't terribly surprising, because regexps that were already
really cheap probably have no capturing parens to dispense with.

Of course, there's no benefit with functions that do need sub-match
data, such as regexp_match. But the added overhead in such cases
should be quite negligible. The only downside I can see is that
if you use the "same" regexp in both submatches-needed and
non-submatches-needed contexts, you'll end up with two separate
compiled regexp cache entries. That doesn't seem like a big
problem though.

regards, tom lane

Andrew Dunstan

andrew@dunslane.net

almost 5 years ago

In reply to: Tom Lane (#1)

Re: Another regexp performance improvement: skip useless paren-captures

On 8/4/21 6:15 PM, Tom Lane wrote:

Here's a little finger exercise that improves a case that's bothered me
for awhile. In a POSIX regexp, parentheses cause capturing by default;
you have to write the very non-obvious "(?:...)" if you don't want the
matching substring to be reported by the regexp engine.

It's not obscure to perl programmers :-)

That'd be fine
if capturing were cheap, but with our engine it is not particularly
cheap. In many situations, the initial DFA check is sufficient to
tell whether there is an overall match, but it does not tell where any
subexpression match boundaries are. To identify exactly which substring
is deemed to match a parenthesized subexpression, we have to recursively
break down the match, which takes at the very least a few more DFA
invocations; and with an uncooperative regex, it can easily result in
O(N^2) behavior where there was none at the DFA stage.

Therefore, we really ought to expend some effort to not capture
subexpressions if the sub-match data is not actually needed, which in
many invocations we know that it isn't. Spencer's original code has
a REG_NOSUB option that looks like it ought to be good for this ... but
on closer inspection it's basically useless, because it turns *all*
parens into non-capturing ones. That breaks back-references, so unless
you know that the regexp contains no back-refs, you can't use it.

In perl you can use the 'n' modifier for this effect (since 5.22)

I would expect to know if a back-ref were present.

I'm a bit worried about how you'll keep track of back-ref numbering
since back-refs only count capturing groups, and you're silently turning
a capturing group into a non-capturing group.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Andrew Dunstan (#2)

Re: Another regexp performance improvement: skip useless paren-captures

Andrew Dunstan <andrew@dunslane.net> writes:

I'm a bit worried about how you'll keep track of back-ref numbering
since back-refs only count capturing groups, and you're silently turning
a capturing group into a non-capturing group.

They're already numbered at this point, and we aren't changing the numbers
of the capturing groups that remain live. There will be unused entries in
the regmatch_t array at runtime (corresponding to the zapped groups), but
that doesn't cost anything worth mentioning.

Now that you mention it, I am not sure whether there are any regression
test cases that specifically cover still being able to match \2 when
the first capture group went away. Probably should add more cases...

regards, tom lane

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Andrew Dunstan (#2)

Re: Another regexp performance improvement: skip useless paren-captures

On Thu, Aug 5, 2021 at 9:43 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 8/4/21 6:15 PM, Tom Lane wrote:

Here's a little finger exercise that improves a case that's bothered me
for awhile. In a POSIX regexp, parentheses cause capturing by default;
you have to write the very non-obvious "(?:...)" if you don't want the
matching substring to be reported by the regexp engine.

It's not obscure to perl programmers :-)

Well, I consider myself a pretty fair perl programmer, and I know
there's a way to do that, but I never do it, and I would have had to
look up the exact syntax. So +1 from me for anything automatic that
avoids paying the overhead in some cases.

--
Robert Haas
EDB: http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Robert Haas (#4)

Re: Another regexp performance improvement: skip useless paren-captures

Robert Haas <robertmhaas@gmail.com> writes:

Well, I consider myself a pretty fair perl programmer, and I know
there's a way to do that, but I never do it, and I would have had to
look up the exact syntax. So +1 from me for anything automatic that
avoids paying the overhead in some cases.

That's my feeling about it too --- I never really think of this
point when writing a regexp. It seems like something the engine
ought to handle gracefully, so this patch is an attempt to do so.

regards, tom lane

Andrew Dunstan

andrew@dunslane.net

almost 5 years ago

In reply to: Robert Haas (#4)

Re: Another regexp performance improvement: skip useless paren-captures

On 8/5/21 10:39 AM, Robert Haas wrote:

On Thu, Aug 5, 2021 at 9:43 AM Andrew Dunstan <andrew@dunslane.net> wrote:

On 8/4/21 6:15 PM, Tom Lane wrote:

Here's a little finger exercise that improves a case that's bothered me
for awhile. In a POSIX regexp, parentheses cause capturing by default;
you have to write the very non-obvious "(?:...)" if you don't want the
matching substring to be reported by the regexp engine.

It's not obscure to perl programmers :-)

Well, I consider myself a pretty fair perl programmer,

I also consider you one :-)

Perhaps I should have said "many perl programmers".

and I know
there's a way to do that, but I never do it, and I would have had to
look up the exact syntax. So +1 from me for anything automatic that
avoids paying the overhead in some cases.

Yeah, I'm not arguing against the idea. I also have to look it up,
mainly because there is such a huge amount of stuff that can follow
"(?", do "perldoc perlre" happens a lot when I'm doing that sort of work.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Tom Lane (#3)

Re: Another regexp performance improvement: skip useless paren-captures

On Aug 5, 2021, at 7:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Probably should add more cases...

The patch triggers an assertion that master does not:

+select 'azrlfkjbjgidgryryiglcabkgqluflu' !~ '(.(.)((.)))((?:(\3)))';
+server closed the connection unexpectedly
+   This probably means the server terminated abnormally
+   before or while processing the request.
+connection to server was lost

The relevant portion of the stack trace:

frame #3: 0x00000001043bcf6d postgres`ExceptionalCondition(conditionName=<unavailable>, errorType=<unavailable>, fileName=<unavailable>, lineNumber=<unavailable>) at assert.c:69:2 [opt]
frame #4: 0x000000010410168b postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd055b0, begin=0x00007f863d821528, end=0x00007f863d82152c) at regexec.c:767:4 [opt]
frame #5: 0x000000010410129b postgres`cdissect [inlined] ccondissect(v=<unavailable>, t=<unavailable>, begin=0x00007f863d821524, end=<unavailable>) at regexec.c:835:10 [opt]
frame #6: 0x000000010410123d postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd05430, begin=0x00007f863d821524, end=0x00007f863d82152c) at regexec.c:752 [opt]
frame #7: 0x000000010410129b postgres`cdissect [inlined] ccondissect(v=<unavailable>, t=<unavailable>, begin=0x00007f863d821520, end=<unavailable>) at regexec.c:835:10 [opt]
frame #8: 0x000000010410123d postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd050f0, begin=0x00007f863d821520, end=0x00007f863d82152c) at regexec.c:752 [opt]
frame #9: 0x0000000104101282 postgres`cdissect [inlined] ccondissect(v=<unavailable>, t=<unavailable>, begin=0x00007f863d821520, end=<unavailable>) at regexec.c:832:9 [opt]
frame #10: 0x000000010410123d postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd04ff0, begin=0x00007f863d821520, end=0x00007f863d821530) at regexec.c:752 [opt]
frame #11: 0x00000001040ff508 postgres`pg_regexec [inlined] cfindloop(v=<unavailable>, cnfa=<unavailable>, cm=<unavailable>, d=0x00007ffeebdd6d68, s=0x00007ffeebdd2b48, coldp=<unavailable>) at regexec.c:600:10 [opt]
frame #12: 0x00000001040ff36b postgres`pg_regexec [inlined] cfind(v=0x000000010459c5f8, cnfa=<unavailable>, cm=<unavailable>) at regexec.c:515 [opt]
frame #13: 0x00000001040ff315 postgres`pg_regexec(re=<unavailable>, string=<unavailable>, len=140732855577960, search_start=<unavailable>, details=<unavailable>, nmatch=0, pmatch=0x0000000000000000, flags=0) at regexec.c:293 [opt]
frame #14: 0x0000000104244d61 postgres`RE_wchar_execute(re=<unavailable>, data=<unavailable>, data_len=<unavailable>, start_search=<unavailable>, nmatch=<unavailable>, pmatch=<unavailable>) at regexp.c:274:19 [opt]
frame #15: 0x0000000104242c80 postgres`textregexne [inlined] RE_execute(dat=<unavailable>, dat_len=31, nmatch=0, pmatch=0x0000000000000000) at regexp.c:322:10 [opt]
frame #16: 0x0000000104242c50 postgres`textregexne [inlined] RE_compile_and_execute(text_re=<unavailable>, dat=<unavailable>, dat_len=31, cflags=19, collation=<unavailable>, nmatch=0, pmatch=<unavailable>) at regexp.c:357 [opt]

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Mark Dilger (#7)

Re: Another regexp performance improvement: skip useless paren-captures

Mark Dilger <mark.dilger@enterprisedb.com> writes:

The patch triggers an assertion that master does not:

+select 'azrlfkjbjgidgryryiglcabkgqluflu' !~ '(.(.)((.)))((?:(\3)))';

On looking into this, it's pretty simple: regexec.c has an assertion
that a pure-capture subre node ought to be doing some capturing.

case '(': /* no-op capture node */
assert(t->child != NULL);
assert(t->capno > 0);

That's fine as of HEAD, but with the proposed patch, we may notice
that the node isn't actually referenced by any backref, and remove
its capture marker, allowing this assertion to fire. Nothing's
really wrong though.

There seem to be three things we could do about that:

1. Extend removecaptures() so that it can actually remove no-op
capture nodes if it's removed their capture markings. This would
substantially complicate that function, and I judge that it's not
worth the trouble. We'll only have such nodes in cases of
capturing parentheses immediately surrounding capturing parentheses,
which doesn't seem like a case worth expending sweat for.

2. Just drop the "t->capno > 0" assertion in regexec.c.

3. Weaken said assertion, perhaps by also checking the BRUSE flag bit.

Not sure that I see any point to #3, so I just dropped the
assertion in the attached.

I've also rebased over the bug fixes from the other thread,
and added a couple more test cases.

regards, tom lane

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Tom Lane (#8)

Re: Another regexp performance improvement: skip useless paren-captures

On Aug 8, 2021, at 10:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I've also rebased over the bug fixes from the other thread,
and added a couple more test cases.

regards, tom lane

Hmm. This changes the behavior when applied against master (c1132aae336c41cf9d316222e525d8d593c2b5d2):

 select regexp_split_to_array('uuuzkodphfbfbfb', '((.))(\1\2)', 'ntw');
  regexp_split_to_array
 -----------------------
- {"",zkodphfbfbfb}
+ {uuuzkodphfbfbfb}
 (1 row)

The string starts with three "u" characters. The first of them is doubly-matched, meaning \1 and \2 refer to the first "u" character. The (\1\2) that follows matches the next two "u" characters. When the extra "useless" capture group is skipped, apparently this doesn't work anymore. I haven't looked at your patch, so I'm not sure why, but I'm guessing that \2 doesn't refer to anything.

That analysis is consistent with the next change:

 select regexp_split_to_array('snfwbvxeesnzqabixqbixqiumpgxdemmxvnsemjxgqoqknrqessmcqmfslfspskqpqxe', '((((?:.))))\3');
-                        regexp_split_to_array                        
----------------------------------------------------------------------
- {snfwbvx,snzqabixqbixqiumpgxde,xvnsemjxgqoqknrqe,mcqmfslfspskqpqxe}
+                         regexp_split_to_array                          
+------------------------------------------------------------------------
+ {snfwbvxeesnzqabixqbixqiumpgxdemmxvnsemjxgqoqknrqessmcqmfslfspskqpqxe}
 (1 row)

The pattern matches any double character. I would expect it to match the "ee", the "mm" and the "ss" in the text. With the patched code, it matches nothing.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Mark Dilger (#9)

Re: Another regexp performance improvement: skip useless paren-captures

Mark Dilger <mark.dilger@enterprisedb.com> writes:

Hmm. This changes the behavior when applied against master (c1132aae336c41cf9d316222e525d8d593c2b5d2):

select regexp_split_to_array('uuuzkodphfbfbfb', '((.))(\1\2)', 'ntw');
regexp_split_to_array
-----------------------
- {"",zkodphfbfbfb}
+ {uuuzkodphfbfbfb}
(1 row)

Ugh. The regex engine is finding the match correctly, but it's failing to
tell the caller where it is :-(. I was a little too cute in optimizing
the regmatch_t result-vector copying in pg_regexec, and forgot to ensure
that the overall match position would be reported.

Thanks for the testing!

regards, tom lane

#11

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Tom Lane (#10)

Re: Another regexp performance improvement: skip useless paren-captures

On Aug 8, 2021, at 1:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ugh. The regex engine is finding the match correctly, but it's failing to
tell the caller where it is :-(. I was a little too cute in optimizing
the regmatch_t result-vector copying in pg_regexec, and forgot to ensure
that the overall match position would be reported.

Thanks for the testing!

Sure! Thanks for improving the regular expression engine!

I have applied your latest patch and do not see any problems with it. All my tests pass with no asserts and with no differences in results vs. master. This is a test suite of nearly 1.5 million separate regular expressions.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Mark Dilger (#11)

Re: Another regexp performance improvement: skip useless paren-captures

Mark Dilger <mark.dilger@enterprisedb.com> writes:

I have applied your latest patch and do not see any problems with it. All my tests pass with no asserts and with no differences in results vs. master. This is a test suite of nearly 1.5 million separate regular expressions.

Cool, thanks. I also tried your millions-of-random-regexps script
and didn't find any difference between the results from HEAD and
those from the v3 patch.

regards, tom lane

#13

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Tom Lane (#12)

Re: Another regexp performance improvement: skip useless paren-captures

On Aug 8, 2021, at 3:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Cool, thanks. I also tried your millions-of-random-regexps script
and didn't find any difference between the results from HEAD and
those from the v3 patch.

The patch looks ready to commit. I don't expect to test it any further unless you have something in particular you'd like me to focus on.

Thanks again for working on this.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Mark Dilger (#13)

Re: Another regexp performance improvement: skip useless paren-captures

Mark Dilger <mark.dilger@enterprisedb.com> writes:

The patch looks ready to commit. I don't expect to test it any further unless you have something in particular you'd like me to focus on.

Pushed, but while re-reading it before commit I noticed that there's
some more fairly low-hanging fruit in regexp_replace(). As I had it
in that patch, it never used REG_NOSUB because of the possibility
that the replacement string uses "\N". However, we're already
pre-checking the replacement string to see if it has backslashes
at all, so while we're at it we can check for \N to discover if we
actually need any subexpression match data or not. We do need to
refactor a little to postpone calling pg_regcomp until after we
know that, but I think that makes replace_text_regexp's API less
ugly not more so.

While I was at it, I changed the search-for-backslash loops to
use memchr rather than handwritten looping. Their use of
pg_mblen was pretty unnecessary given we only need to find
backslashes, and we can assume the backend encoding is ASCII-safe.

Using a bunch of random cases generated by your little perl
script, I see maybe 10-15% speedup on test cases that don't
use \N in the replacement string, while it's about a wash
on cases that do. (If I'd been using a multibyte encoding,
maybe the memchr change would have made a difference, but
I didn't try that.)

regards, tom lane

#15

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Tom Lane (#14)

Re: Another regexp performance improvement: skip useless paren-captures

Tom,

I can still trigger the old bug for which we thought we'd pushed a fix. The test case below crashes on master (e12694523e7e4482a052236f12d3d8b58be9a22c), and also on the fixed version "Make regexp engine's backref-related compilation state more bulletproof." (cb76fbd7ec87e44b3c53165d68dc2747f7e26a9a).

Can you test if it crashes for you, too? I'm not sure I see why this one fails when millions of others pass.

The backtrace is still complaining about regc_nfa.c:1265:

+select regexp_split_to_array('', '(?:((?:q+))){0}(\1){0,0}?*[^]');
+server closed the connection unexpectedly
+   This probably means the server terminated abnormally
+   before or while processing the request.
+connection to server was lost

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Mark Dilger (#15)

Re: Another regexp performance improvement: skip useless paren-captures

Mark Dilger <mark.dilger@enterprisedb.com> writes:

I can still trigger the old bug for which we thought we'd pushed a fix. The test case below crashes on master (e12694523e7e4482a052236f12d3d8b58be9a22c), and also on the fixed version "Make regexp engine's backref-related compilation state more bulletproof." (cb76fbd7ec87e44b3c53165d68dc2747f7e26a9a).

Can you test if it crashes for you, too? I'm not sure I see why this one fails when millions of others pass.

The backtrace is still complaining about regc_nfa.c:1265:

+select regexp_split_to_array('', '(?:((?:q+))){0}(\1){0,0}?*[^]');
+server closed the connection unexpectedly

Hmmm ... yeah, I see it too. This points up something I'd wondered
about before, which is whether the code that "cancels everything"
after detecting {0} is really OK. It throws away the outer subre
*and children* without worrying about what might be inside, and
here we see that that's not good enough --- there's still a v->subs
pointer to the first capturing paren set, which we just deleted,
so that the \1 later on messes up. I'm not sure why the back
branches are managing not to crash, but that might just be a memory
management artifact.

regards, tom lane

#17

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Tom Lane (#16)

Re: Another regexp performance improvement: skip useless paren-captures

I wrote:

Hmmm ... yeah, I see it too. This points up something I'd wondered
about before, which is whether the code that "cancels everything"
after detecting {0} is really OK. It throws away the outer subre
*and children* without worrying about what might be inside, and
here we see that that's not good enough --- there's still a v->subs
pointer to the first capturing paren set, which we just deleted,
so that the \1 later on messes up. I'm not sure why the back
branches are managing not to crash, but that might just be a memory
management artifact.

... yeah, it is. For me, this variant hits the assertion in all
branches:

regression=# select regexp_split_to_array('', '((.)){0}(\2){0}');
server closed the connection unexpectedly

So that's a pre-existing (and very long-standing) bug. I'm not
sure if it has any serious impact in non-assert builds though.
Failure to clean out some disconnected arcs probably has no
real effect on the regex's behavior later.

regards, tom lane

#18

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Mark Dilger (#15)

Re: Another regexp performance improvement: skip useless paren-captures

Mark Dilger <mark.dilger@enterprisedb.com> writes:

+select regexp_split_to_array('', '(?:((?:q+))){0}(\1){0,0}?*[^]');
+server closed the connection unexpectedly

Here's a quick draft patch for this. Basically it moves the
responsibility for clearing v->subs[] pointers into the freesubre()
recursion, so that it will happen for contained capturing parens
not only the top level.

There is a potentially interesting definitional question:
what exactly ought this regexp do?

((.)){0}\2

Because the capturing paren sets are zero-quantified, they will
never be matched to any characters, so the backref can never
have any defined referent. I suspect that study of the POSIX
spec would lead to the conclusion that this is a legal regexp
but it will never match anything. Implementing that would be
tedious though, and what's more it seems very unlikely that
the user wanted any such behavior. So I think throwing an
error is an appropriate response. The existing code will
throw such an error for

((.)){0}\1

so I guess Spencer did think about this to some extent -- he
just forgot about the possibility of nested parens.

This patch should work OK in HEAD and v14, but it will need
a bit of fooling-about for older branches I think, given that
they fill v->subs[] a little differently.

regards, tom lane

#19

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Tom Lane (#14)

Re: Another regexp performance improvement: skip useless paren-captures

On Aug 9, 2021, at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Pushed, but while re-reading it before commit I noticed that there's
some more fairly low-hanging fruit in regexp_replace(). As I had it
in that patch, it never used REG_NOSUB because of the possibility
that the replacement string uses "\N". However, we're already
pre-checking the replacement string to see if it has backslashes
at all, so while we're at it we can check for \N to discover if we
actually need any subexpression match data or not. We do need to
refactor a little to postpone calling pg_regcomp until after we
know that, but I think that makes replace_text_regexp's API less
ugly not more so.

While I was at it, I changed the search-for-backslash loops to
use memchr rather than handwritten looping. Their use of
pg_mblen was pretty unnecessary given we only need to find
backslashes, and we can assume the backend encoding is ASCII-safe.

Using a bunch of random cases generated by your little perl
script, I see maybe 10-15% speedup on test cases that don't
use \N in the replacement string, while it's about a wash
on cases that do. (If I'd been using a multibyte encoding,
maybe the memchr change would have made a difference, but
I didn't try that.)

I've been reviewing and testing this (let-regexp_replace-use-NOSUB.patch) since you sent it 4 hours ago, and I can't seem to break it. There are pre-existing problems in the regex code, but this doesn't seem to add any new breakage.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Tom Lane (#18)

Re: Another regexp performance improvement: skip useless paren-captures

On Aug 9, 2021, at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There is a potentially interesting definitional question:
what exactly ought this regexp do?

((.)){0}\2

Because the capturing paren sets are zero-quantified, they will
never be matched to any characters, so the backref can never
have any defined referent.

Perl regular expressions are not POSIX, but if there is a principled reason POSIX should differ from perl on this, we should be clear what that is:

#!/usr/bin/perl

use strict;
use warnings;

our $match;
if ('foo' =~ m/((.)(??{ die; })){0}(..)/)
{
print "captured 1 $1\n" if defined $1;
print "captured 2 $2\n" if defined $2;
print "captured 3 $3\n" if defined $3;
print "captured 4 $4\n" if defined $4;
print "match = $match\n" if defined $match;
}

This will print "captured 3 fo", proving that although the regular expression is parsed with the (..) bound to the third capture group, the first two capture groups never run. If you don't believe that, change the {0} to {1} and observe that the script dies.

So I think throwing an
error is an appropriate response. The existing code will
throw such an error for

((.)){0}\1

so I guess Spencer did think about this to some extent -- he
just forgot about the possibility of nested parens.

Ugg. That means our code throws an error where perl does not, pretty well negating my point above. If we're already throwing an error for this type of thing, I agree we should be consistent about it. My personal preference would have been to do the same thing as perl, but it seems that ship has already sailed.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Mark Dilger (#20)

#22

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Mark Dilger (#19)

#23