Pathological regexp match

Started by Michael Glaesemannover 16 years ago15 messageshackers

Jump to latest

Michael Glaesemann

grzm@seespotcode.net

over 16 years ago

We came across a regexp that takes very much longer than expected.

PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit

SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email brevity

?column?
----------
t
(1 row)

Time: 90525.148 ms

The full query is below.

The same match in perl takes less than 0.01 seconds on the same hardware.

#!/bin/env perl
use warnings;
use strict;

my $sample = 'ooo...'; # omitted for email brevity

if ($sample =~ /Z(Q)[^Q]*A.*?(\1)/) {
print 'matches';
}
else {
print 'does not match';
}

This is a simplified version of a match that finally finished after 18 hours.

Given the nearly 4 orders of magnitude difference between the Perl script and the Postgres version, is there something that could be improved in the Postgres regex engine?

Cheers,

Michael Glaesemann
michael.glaesemann@myyearbook.com

SELECT 'ooooooooooQooQoQooooooooQoooooooooooooooooooooooooooooooooooooooQoooooQoooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooooooooooZQooooooooooooooooooooooooooQooQoQooooooooooooQoQoooooooooooooooooooooooooooooooooooooQoooooooooooooooooooooooooooooooooooQoooooooQooooooQooooZQoooooooooooAooooQoooooooQooooooooooooooooooooooQoooooooQooooooooooooooooooooooQoooooQoooooooooooAooooooooooooooooooooooooooooooooooooooooooooQooooooooQoQooooooooooooooooooooooooooQoQooooooQoooooooQoooooooQoooooooQooooooooooooooooooooooooooooooooooooooooooZQoooooooooooooooooQooQoQoooooooQooooooooooooooooooooooooooooooooooooooooooQoooooQoooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooQoQooooooooooooooooZQoooooooooooooooooQooQoQooooooooooooQoQoooooooooooooooooooooZQooooooooooooooooooooooooooQooQoQooooooooQoooooooooooooooooooooooooooooooooooooooQoooooQooooooooooooooooooooooooooooooooooooooooooooooQooooooooooooooooZQooooooooooooooooooooooooooQooQoQooooooooooooQoQooooooooooooooooooooooooooooooZQoooooooooooooQooQoQooooooooooQoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooQoooooooooooooooooooooooooQoooooooQooooooooooooooooooooooooooooooooooooooQooooooooQoQoooooooooooooooooooooooooooooZQooooooooooooQooQoQooooooooooooooooooooooooooooooooooooooooooooooooooooooooZQooooooooooooooooooooooooooooooQooQoQoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooQooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooQooooZQooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooooQooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooQooooooooooQoooooooQoooooooooooQoooooooooooooo' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$;

Alvaro Herrera

alvherre@2ndquadrant.com

over 16 years ago

In reply to: Michael Glaesemann (#1)

Re: Pathological regexp match

Hi Michael,

Michael Glaesemann wrote:

We came across a regexp that takes very much longer than expected.

PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit

SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email brevity

The ? after .* is pointless. If you remove it, the query returns
immediately.

(There's a badly needed CHECK_FOR_INTERRUPTS in this code BTW)

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Michael Glaesemann

grzm@seespotcode.net

over 16 years ago

In reply to: Alvaro Herrera (#2)

Re: Pathological regexp match

On Jan 28, 2010, at 21:59 , Alvaro Herrera wrote:

Hi Michael,

Michael Glaesemann wrote:

We came across a regexp that takes very much longer than expected.

PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc
(GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit

SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email
brevity

The ? after .* is pointless.

Interesting. I would expect that *? would be the non-greedy version of
*, meaning match up to the first \1 (in this case the first Q
following A), rather than as much as possible.

For example, in Perl:
$ perl -e " if ('oooZQoooAoooQooQooQooo' =~ /Z(Q)[^Q]*A.*(\1)/)
{ print \$&; } else { print 'NO'; }" && echo
ZQoooAoooQooQooQ
$ perl -e " if ('oooZQoooAoooQooQooQooo' =~ /Z(Q)[^Q]*A.*?(\1)/)
{ print \$&; } else { print 'NO'; }" && echo
ZQoooAoooQ

If I'm reading the docs right, Postgres does support non-greedy * as *?:

<http://www.postgresql.org/docs/8.4/interactive/functions-matching.html#POSIX-QUANTIFIERS-TABLE

However, as you point out, Postgres doesn't appear to take this into
account:

postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)
[^Q]*A.*(\2))$r$, $s$X$s$);
regexp_replace
----------------
oooXooo
(1 row)

postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)
[^Q]*A.*?(\2))$r$, $s$X$s$);
regexp_replace
----------------
oooXooo
(1 row)

Michael Glaesemann
michael.glaesemann@myyearbook.com

Alvaro Herrera

alvherre@2ndquadrant.com

over 16 years ago

In reply to: Michael Glaesemann (#3)

Re: Pathological regexp match

Michael Glaesemann wrote:

On Jan 28, 2010, at 21:59 , Alvaro Herrera wrote:

Hi Michael,

Michael Glaesemann wrote:

We came across a regexp that takes very much longer than expected.

PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit

SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email
brevity

The ? after .* is pointless.

Interesting. I would expect that *? would be the non-greedy version
of *, meaning match up to the first \1 (in this case the first Q
following A), rather than as much as possible.

Huh, you are right, *? is the non-greedy version. I keep forgetting
those. Note that they only work if you have regex_flavor set to
advanced, though (which is the default).

However, as you point out, Postgres doesn't appear to take this into
account:

postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)
[^Q]*A.*(\2))$r$, $s$X$s$);
regexp_replace
----------------
oooXooo
(1 row)

postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)
[^Q]*A.*?(\2))$r$, $s$X$s$);
regexp_replace
----------------
oooXooo
(1 row)

Hmm, that's strange ...

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Alvaro Herrera

alvherre@2ndquadrant.com

over 16 years ago

In reply to: Michael Glaesemann (#3)

Re: Pathological regexp match

Michael Glaesemann wrote:

However, as you point out, Postgres doesn't appear to take this into
account:

postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)
[^Q]*A.*(\2))$r$, $s$X$s$);
regexp_replace
----------------
oooXooo
(1 row)

postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)
[^Q]*A.*?(\2))$r$, $s$X$s$);
regexp_replace
----------------
oooXooo
(1 row)

I think the reason for this is that the first * is greedy and thus the
entire expression is considered greedy. The fact that you've made the
second * non-greedy does not ungreedify the RE ... Note the docs say:

The above rules associate greediness attributes not only with
individual quantified atoms, but with branches and entire REs
that contain quantified atoms. What that means is that the
matching is done in such a way that the branch, or whole RE,
matches the longest or shortest possible substring as a whole.

It's late here so I'm not sure if this is what you're looking for:

alvherre=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)[^Q]*?A.*(\2))$r$, $s$X$s$);
regexp_replace
----------------
oooXooQooQooo
(1 fila)

(Obviously the non-greediness has moved somewhere else) :-(

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Michael Glaesemann

grzm@seespotcode.net

over 16 years ago

In reply to: Alvaro Herrera (#5)

Re: Pathological regexp match

On Jan 28, 2010, at 23:21 , Alvaro Herrera wrote:

I think the reason for this is that the first * is greedy and thus the
entire expression is considered greedy. The fact that you've made the
second * non-greedy does not ungreedify the RE ... Note the docs say:

The above rules associate greediness attributes not only with
individual quantified atoms, but with branches and entire REs
that contain quantified atoms. What that means is that the
matching is done in such a way that the branch, or whole RE,
matches the longest or shortest possible substring as a whole.

Interesting. Thanks for pointing out this section of the docs. I
wasn't aware of this twist.

It's late here so I'm not sure if this is what you're looking for:

I'm not actually looking for a regexp that works: I was able to
accomplish the task I had at hand with a different regexp. I'm just
reporting the particular unexpected nastiness we ran into. :)

Michael Glaesemann
michael.glaesemann@myyearbook.com

Magnus Hagander

magnus@hagander.net

over 16 years ago

In reply to: Alvaro Herrera (#2)

Re: Pathological regexp match

2010/1/29 Alvaro Herrera <alvherre@commandprompt.com>:

Hi Michael,

Michael Glaesemann wrote:

We came across a regexp that takes very much longer than expected.

PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit

SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email brevity

The ? after .* is pointless. If you remove it, the query returns
immediately.

(There's a badly needed CHECK_FOR_INTERRUPTS in this code BTW)

Incidentally, I ran across the exact same issue with a non-greedy
regexp with a client earlier this week, and put on my TODO to figure
out a good place to stick a check for interrupts. Does this mean I
don't have to, because you're on it? ;)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Alvaro Herrera

alvherre@2ndquadrant.com

over 16 years ago

In reply to: Magnus Hagander (#7)

Re: Pathological regexp match

Magnus Hagander wrote:

2010/1/29 Alvaro Herrera <alvherre@commandprompt.com>:

(There's a badly needed CHECK_FOR_INTERRUPTS in this code BTW)

Incidentally, I ran across the exact same issue with a non-greedy
regexp with a client earlier this week, and put on my TODO to figure
out a good place to stick a check for interrupts. Does this mean I
don't have to, because you're on it? ;)

No, sorry :-(

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Tom Lane

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: Michael Glaesemann (#1)

Re: Pathological regexp match

Michael Glaesemann <michael.glaesemann@myyearbook.com> writes:

We came across a regexp that takes very much longer than expected.

I poked into this a little bit. What seems to be happening is that the
use of non-greedy quantifiers plus backreferences turns off most of the
optimization that the regexp engine usually does, leaving the RE as a
tree of possibilities that is explored in a fairly dumb fashion. In
particular, there is a loop in crevdissect() that tries to locate a
feasible division point between two concatenated sub-patterns, and
for each try, a recursive call to crevdissect() tries to locate a
feasible division point between *its* two sub-patterns, resulting in
O(N^2) runtime. With a not very pleasant constant factor, too, because
of repeated startup/shutdown costs for the DFA matching engine.

I found a possible patch that seems to improve matters: AFAICS the DFA
matching is independent of the checking that cdissect() and friends do,
so we can apply that check first instead of second. This brings the
runtime down from minutes to sub-second on my machine. However I don't
have a whole lot of faith either in the correctness of this change, or
that it might not make some other cases slower instead of faster.
Has anyone got a reasonably messy collection of regexps they'd like
to try this patch on?

BTW, I filed the problem upstream with the Tcl project
https://sourceforge.net/tracker/?func=detail&aid=2942697&group_id=10894&atid=110894
but I'm not going to hold my breath waiting for useful advice from
them. Since Henry Spencer dropped off the net, there doesn't seem
to be anybody there who understands that code much better than we do.

Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
in there ...

regards, tom lane

#10

Magnus Hagander

magnus@hagander.net

over 16 years ago

In reply to: Tom Lane (#9)

Re: Pathological regexp match

On Sat, Jan 30, 2010 at 08:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michael Glaesemann <michael.glaesemann@myyearbook.com> writes:

We came across a regexp that takes very much longer than expected.

I poked into this a little bit. What seems to be happening is that the
use of non-greedy quantifiers plus backreferences turns off most of the
optimization that the regexp engine usually does, leaving the RE as a
tree of possibilities that is explored in a fairly dumb fashion. In
particular, there is a loop in crevdissect() that tries to locate a
feasible division point between two concatenated sub-patterns, and
for each try, a recursive call to crevdissect() tries to locate a
feasible division point between *its* two sub-patterns, resulting in
O(N^2) runtime. With a not very pleasant constant factor, too, because
of repeated startup/shutdown costs for the DFA matching engine.

I found a possible patch that seems to improve matters: AFAICS the DFA
matching is independent of the checking that cdissect() and friends do,
so we can apply that check first instead of second. This brings the
runtime down from minutes to sub-second on my machine. However I don't
have a whole lot of faith either in the correctness of this change, or
that it might not make some other cases slower instead of faster.
Has anyone got a reasonably messy collection of regexps they'd like
to try this patch on?

I only have the one, so I don't think I can help all that much with that.

BTW, I filed the problem upstream with the Tcl project
https://sourceforge.net/tracker/?func=detail&aid=2942697&group_id=10894&atid=110894
but I'm not going to hold my breath waiting for useful advice from
them. Since Henry Spencer dropped off the net, there doesn't seem
to be anybody there who understands that code much better than we do.

Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
in there ...

Yeah. I have zero experience around that code, so if oyu have at least
some, I'd much appreciate it if you (or someone who does) could look
at it. Likely to cause a lot less breakage than me :D

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#11

Tom Lane

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: Tom Lane (#9)

Re: Pathological regexp match

I wrote:

I found a possible patch that seems to improve matters: AFAICS the DFA
matching is independent of the checking that cdissect() and friends do,
so we can apply that check first instead of second. This brings the
runtime down from minutes to sub-second on my machine. However I don't
have a whole lot of faith either in the correctness of this change, or
that it might not make some other cases slower instead of faster.
Has anyone got a reasonably messy collection of regexps they'd like
to try this patch on?

The Tcl folk accepted that patch, so I went ahead and applied it to
our code. It would still be a good idea for us to do any testing we
can on it, though.

Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
in there ...

I looked at this a bit and decided that it would be messier than seems
justified in the absence of known problems. The regex code uses malloc
not palloc for transient data structures, so we can't just stick a
CHECK_FOR_INTERRUPTS() into some inner loop hotspot --- throwing a
longjmp would mean a permanent memory leak. I looked at integrating
into the error mechanism it does have, but it turns out that that's
not particularly designed to force immediate exit on failure either;
they just set a flag bit that will be tested whenever control does
return from the match. I think that a good fix would require first
changing the mallocs to pallocs, then add CHECK_FOR_INTERRUPTS.
But that's not something I have time to mess with at the moment.

regards, tom lane

#12

Michael Glaesemann

grzm@seespotcode.net

over 16 years ago

In reply to: Tom Lane (#11)

Re: Pathological regexp match

On Jan 31, 2010, at 22:14 , Tom Lane wrote:

The Tcl folk accepted that patch, so I went ahead and applied it to
our code. It would still be a good idea for us to do any testing we
can on it, though.

I applied the patch and ran both the test query I submitted as well as
original problematic query that triggered the report, and it runs much
faster. Thanks for the fix!

Michael Glaesemann
michael.glaesemann@myyearbook.com

#13

Magnus Hagander

magnus@hagander.net

over 16 years ago

In reply to: Michael Glaesemann (#12)

Re: Pathological regexp match

2010/2/1 Michael Glaesemann <michael.glaesemann@myyearbook.com>:

On Jan 31, 2010, at 22:14 , Tom Lane wrote:

The Tcl folk accepted that patch, so I went ahead and applied it to
our code. It would still be a good idea for us to do any testing we
can on it, though.

I applied the patch and ran both the test query I submitted as well as original problematic query that triggered the report, and it runs much faster. Thanks for the fix!

I did the same, and it does not help in my case. FWIW, the regexp I'm
matching is:
<pre .*?>(.*?)</pre>

(yes, the production system has already been fixed to use a smarter
regexp that solves the same problem)

The text is about 180Kb. PostgreSQL takes ~40 seconds without the
patch, ~36 seconds with it, to extract the match from it. Perl takes
0.016 seconds.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#14

David E. Wheeler

david@kineticode.com

over 16 years ago

In reply to: Magnus Hagander (#13)

Re: Pathological regexp match

On Feb 8, 2010, at 5:15 AM, Magnus Hagander wrote:

The text is about 180Kb. PostgreSQL takes ~40 seconds without the
patch, ~36 seconds with it, to extract the match from it. Perl takes
0.016 seconds.

Obviously we need to support Perl regular expressions in core. Not PCRE, but Perl. ;-P

Best,

David

#15

Joachim Wieland

joe@mcknight.de

over 16 years ago

In reply to: David E. Wheeler (#14)

Re: Pathological regexp match

On Mon, Feb 8, 2010 at 6:07 PM, David E. Wheeler <david@kineticode.com> wrote:

On Feb 8, 2010, at 5:15 AM, Magnus Hagander wrote:

The text is about 180Kb. PostgreSQL takes ~40 seconds without the
patch, ~36 seconds with it, to extract the match from it. Perl takes
0.016 seconds.

Obviously we need to support Perl regular expressions in core. Not PCRE, but Perl. ;-P

FWIW, PCRE is BSD licensed, so we could actually include it... Would
be interesting to see how it performs on the pattern at hand.

Joachim