POSIX regex performance bug in 7.3 Vs. 7.2

Started by wadeabout 23 years ago40 messageshackers
Jump to latest
#1wade
wade@wavefire.com

Hello,
We recently upgraded a project from 7.2 to 7.3.1 to make use of some of
the cool new features in 7.3. The installed version is CVS stable from
yesterday. However, we noticed a major performance hit in POSIX regular
expression matches against columns using the ~* operator.
http://arch.wavefire.com/badregex73.txt has explain analyze output from 7.2
and 7.3.1 respectively. Both of these tables have only 101 rows. The
7.3.1 install is using the default settings from postgresql.conf.

Any ideas as to why this should be happening?

Should anyone require additional information, please let me know.

-Wade Klaver

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: wade (#1)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

wade <wade@wavefire.com> writes:

We recently upgraded a project from 7.2 to 7.3.1 to make use of some of
the cool new features in 7.3. The installed version is CVS stable from
yesterday. However, we noticed a major performance hit in POSIX regular
expression matches against columns using the ~* operator.

The only thought that comes to mind is that multibyte character sets are
supported in 7.3 whereas they were optional (and not default) in 7.2.
I'd not have expected a factor-of-150 performance hit from that, though.

Could you rebuild your backend with profiling enabled and get a gprof
summary of where the time is going?

regards, tom lane

#3Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: wade (#1)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

Why on earth are you using a CVS version!?!?!?!

Chris

On Fri, 31 Jan 2003, wade wrote:

Show quoted text

Hello,
We recently upgraded a project from 7.2 to 7.3.1 to make use of some of
the cool new features in 7.3. The installed version is CVS stable from
yesterday. However, we noticed a major performance hit in POSIX regular
expression matches against columns using the ~* operator.
http://arch.wavefire.com/badregex73.txt has explain analyze output from 7.2
and 7.3.1 respectively. Both of these tables have only 101 rows. The
7.3.1 install is using the default settings from postgresql.conf.

Any ideas as to why this should be happening?

Should anyone require additional information, please let me know.

-Wade Klaver

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Kings-Lynne (#3)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

Why on earth are you using a CVS version!?!?!?!

I assume he meant tip of REL7_3 branch --- which is a perfectly
reasonable thing to install, even if there are still a few fixes
to go before we call it 7.3.2.

regards, tom lane

#5wade
wade@wavefire.com
In reply to: Tom Lane (#4)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

At 08:31 PM 2/1/03 +0800, Christopher Kings-Lynne wrote:

Why on earth are you using a CVS version!?!?!?!

Chris

This problem manifests itself under 7.3.1 release as well. CVS is used so
we can access patches to the SRF stuff implemented after 7.3.1 was released.

Tom... any links that document the procedure for obtaining the profile
information you requested? I've used a profiler briefly, but am not sure
how to go about gathering information pertainent to this problematic query
only.

-Wade Klaver.

#6wade
wade@wavefire.com
In reply to: wade (#5)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

At 10:52 PM 1/31/03 -0500, Tom Lane wrote:

wade <wade@wavefire.com> writes:

We recently upgraded a project from 7.2 to 7.3.1 to make use of some of
the cool new features in 7.3. The installed version is CVS stable from
yesterday. However, we noticed a major performance hit in POSIX regular
expression matches against columns using the ~* operator.

The only thought that comes to mind is that multibyte character sets are
supported in 7.3 whereas they were optional (and not default) in 7.2.
I'd not have expected a factor-of-150 performance hit from that, though.

Could you rebuild your backend with profiling enabled and get a gprof
summary of where the time is going?

regards, tom lane

Here is the profile information. I included a log of the session that
generated it at the top of the gprof output. If there is any other info I
can help you with, please let me know.
http://arch.wavefire.com/pgregexgmon.txt

-Wade Klaver

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: wade (#6)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

wade <wade@wavefire.com> writes:

Here is the profile information. I included a log of the session that
generated it at the top of the gprof output. If there is any other info I
can help you with, please let me know.

A four-second test isn't long enough to gather any statistically
meaningful profile info. On most machines, gprof samples 100 times per
second, so realistically you need a minute or two of runtime to have
trustworthy numbers.

Please replicate the rows in the table by a factor of ten or twenty or
so and try again.

regards, tom lane

#8wade
wade@wavefire.com
In reply to: Tom Lane (#7)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

At 05:51 PM 2/3/03 -0500, Tom Lane wrote:

wade <wade@wavefire.com> writes:

Here is the profile information. I included a log of the session that
generated it at the top of the gprof output. If there is any other info I
can help you with, please let me know.

A four-second test isn't long enough to gather any statistically
meaningful profile info. On most machines, gprof samples 100 times per
second, so realistically you need a minute or two of runtime to have
trustworthy numbers.

Please replicate the rows in the table by a factor of ten or twenty or
so and try again.

regards, tom lane

OK, here goes again.
I have tried a different table, this one with 27444 rows.
In this case, the query with the regex of the form "row ~* 'regex'"
runs in 1.113 seconds and the other runs in 600.
The session idled for a while after the query completed if that makes a
difference.
Queries and explain output are included at the top of the gprof output.
http://arch.wavefire.com/pgregexgmon.txt
-Wade Klaver

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: wade (#8)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

Sigh. It seems that somebody broke caching of compiled regexes,
so that your regex is recompiled each time it's used. I haven't
dug into the logic yet, but I think it must have been a mistake
in Thomas' change to make the regex cache be searched circularly:

2002-06-14 22:49 thomas

* src/backend/utils/adt/regexp.c: Search the existing regular
expression cache as a ring buffer. Will optimize the case for
repeated calls for the same expression, which seems to be the most
common case. Formerly, always searched from the first entry. May
want to look at the least-recently-used algorithm to make sure it
is identifying the right slots to reclaim. Seems silly to do math
when it seems that we could simply use an incrementing counter...

Considering that we now know that this is a factor-of-150 performance
hit, I wonder if this is a "must fix" for 7.3.2? We already wrapped
the tarball, but ...

regards, tom lane

#10wade
wade@wavefire.com
In reply to: Tom Lane (#9)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

Well, IMHO I would rather see a delay of the roll-out by a day or two
than see a release with such a serious performance glitch. Especially
since I personally have been shooting my big mouth off to all my geek
friends on the leaps and bounds PG has made in the last few releases. With
my luck one of them will find it. :)
I guess in the end, it comes down to the rest of you developer types, but
I would be inclined to re-wrap. However, this is easy for me to say given
that I have no idea how much work it actually is to re-wrap.
-Wade Klaver

At 08:07 PM 2/3/03 -0500, Tom Lane wrote:

Show quoted text

Sigh. It seems that somebody broke caching of compiled regexes,
so that your regex is recompiled each time it's used. I haven't
dug into the logic yet, but I think it must have been a mistake
in Thomas' change to make the regex cache be searched circularly:

2002-06-14 22:49 thomas

* src/backend/utils/adt/regexp.c: Search the existing regular
expression cache as a ring buffer. Will optimize the case for
repeated calls for the same expression, which seems to be the most
common case. Formerly, always searched from the first entry. May
want to look at the least-recently-used algorithm to make sure it
is identifying the right slots to reclaim. Seems silly to do math
when it seems that we could simply use an incrementing counter...

Considering that we now know that this is a factor-of-150 performance
hit, I wonder if this is a "must fix" for 7.3.2? We already wrapped
the tarball, but ...

regards, tom lane

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: wade (#10)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

Wade, how many distinct patterns do you have in that table? What's the
population distribution (in particular, do the top 32 patterns account
for most of the table)?

It's looking like the issue is not so much that the 7.3 code is
completely broken, as that its LRU replacement policy for precompiled
regex patterns got rejiggered in a way that doesn't match your data.

regards, tom lane

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: wade (#8)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

Next question: may I guess that you weren't using MULTIBYTE in 7.2?

After still more digging, I'm coming round to the opinion that the
problem is that MULTIBYTE is forced on in 7.3, and this imposes a
factor-of-256 overhead in a bunch of the operations in regcomp.c.
In particular, compiling a case-independent regex is now hugely
more expensive than it used to be.

The parties who wanted to force MULTIBYTE on promised that there
would be no such penalties :-(

regards, tom lane

#13wade
wade@wavefire.com
In reply to: Tom Lane (#12)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

OK,
I redid my trials with the same data set on 7.2.3 --with-multibyte and I
get the same brutal performance hit, so it is definitely a
multibyte-specific problem.
WRT the distribution of the data in the table, I used the following:
All g-words in /usr/share/dict with different processes attached:
no process
init caps.
word || row_id
etc...

There are only about 1000 words that appear more than once (2 or 3 times)
in 27k rows.
-Wade Klaver

At 11:08 PM 2/3/03 -0500, Tom Lane wrote:

Show quoted text

Next question: may I guess that you weren't using MULTIBYTE in 7.2?

After still more digging, I'm coming round to the opinion that the
problem is that MULTIBYTE is forced on in 7.3, and this imposes a
factor-of-256 overhead in a bunch of the operations in regcomp.c.
In particular, compiling a case-independent regex is now hugely
more expensive than it used to be.

The parties who wanted to force MULTIBYTE on promised that there
would be no such penalties :-(

regards, tom lane

#14Neil Conway
neilc@samurai.com
In reply to: wade (#13)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

On Tue, 2003-02-04 at 11:24, wade wrote:

I redid my trials with the same data set on 7.2.3 --with-multibyte and I
get the same brutal performance hit, so it is definitely a
multibyte-specific problem.

Given that this problem isn't a regression, I don't think we need to
delay 7.3.2 to fix it (of course, a fix for 7.3.3 and 7.4 is essential,
IMHO).

Cheers,

Neil
--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: wade (#13)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

wade <wade@wavefire.com> writes:

I redid my trials with the same data set on 7.2.3 --with-multibyte and I
get the same brutal performance hit, so it is definitely a
multibyte-specific problem.

There are only about 1000 words that appear more than once (2 or 3 times)
in 27k rows.

Right, so the caching of compiled regexps that regexp.c does is of no
help, and any change in its behavior in 7.3 wouldn't have made any
difference anyway. I leapt to a conclusion after reviewing the CVS
logs for pertinent changes, but it was the wrong conclusion. The true
problem is that MULTIBYTE is now forced on, and that causes some
loops in the regexp compiler to change from 256 to 65536 iterations.

I believe if you change NC in src/include/regex/utils.h from its new
value of 65536 back to 256, performance will go back where it was.
This will *not* do if you run any multibyte character sets --- but
as long as the database is all ASCII or ISO-8859-whatever, it should
be a safe hack that will let you use 7.3.*.

Rather than trying to band-aid a solution like this in the main sources,
I think I shall go investigate Spencer's new regexp code in Tcl, which
reputedly is designed for wider-than-8-bit chars from the get-go. We've
had it on the TODO list for a long time to assimilate that code; it's
probably time to make it happen.

regards, tom lane

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#14)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

Neil Conway <neilc@samurai.com> writes:

Given that this problem isn't a regression, I don't think we need to
delay 7.3.2 to fix it (of course, a fix for 7.3.3 and 7.4 is essential,
IMHO).

No, I've had to abandon my original thought that it was a localized bug,
so it's not going to be fixed in 7.3.2.

The real problem is simply that we're up against design limitations of
the existing regex package, which was never designed for wider-than-8-bit
character sets. It's been rather crudely hacked while it was in our
hands (Henry Spencer would probably disown the code if he saw it now ;-))
so that it sorta kinda does MULTIBYTE, but it's slow and I don't think
it's complete either.

I'm about to go off and look at whether we can absorb the Tcl regex
package, which is Spencer's new baby. That will not be a solution for
7.3.anything, but it could be an answer for 7.4.

regards, tom lane

#17Neil Conway
neilc@samurai.com
In reply to: Tom Lane (#16)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

On Tue, 2003-02-04 at 11:59, Tom Lane wrote:

I'm about to go off and look at whether we can absorb the Tcl regex
package, which is Spencer's new baby. That will not be a solution for
7.3.anything, but it could be an answer for 7.4.

Sounds like we had about the same idea at about the same time -- I
emailed Henry Spencer inquiring about the new RE engine last night. I
came across a post this post that indicates he was planning to package
the new RE engine separately:

http://infosoc.uni-koeln.de/pipermail/php/1999-February/000019.html

but I wasn't able to find a release of it anywhere -- I'll let the list
know if/when he gets back to me.

Another option is to consider a different regular expression engine. At
least according to the benchmarks here,

http://ourworld.compuserve.com/homepages/john_maddock/proposals/exregex.htm

Spencer's implementation is outperformed by some other RE engines,
notably PCRE (www.pcre.org). But switching to another engine might
impose backward-compatibility problems, in terms of the details of the
RE syntax.

Cheers,

Neil
--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#17)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

Neil Conway <neilc@samurai.com> writes:

Sounds like we had about the same idea at about the same time -- I
emailed Henry Spencer inquiring about the new RE engine last night.

I just did that this morning ;-) ... but more as politeness than
anything else. AFAICT from searching the net, packaging his new code
as a separate library is something that's been on Spencer's TODO list
for several years now. We've been waiting for him to do it, but I'm now
thinking that it's time to quit waiting. We can lift the code from Tcl
with probably not all that much more work than if it were an official
separate package.

Another option is to consider a different regular expression engine. At
least according to the benchmarks here,
http://ourworld.compuserve.com/homepages/john_maddock/proposals/exregex.htm
Spencer's implementation is outperformed by some other RE engines,
notably PCRE (www.pcre.org).

AFAICT, that page is benchmarking Spencer's old code (the same library
we started from). His new code is state-of-the-art according to Friedl
in _Mastering Regular Expressions_, 2nd ed 2002 (O'Reilly).

regards, tom lane

#19Jon Jensen
jon@endpoint.com
In reply to: Neil Conway (#17)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

On Tue, 4 Feb 2003, Neil Conway wrote:

Spencer's implementation is outperformed by some other RE engines,
notably PCRE (www.pcre.org). But switching to another engine might
impose backward-compatibility problems, in terms of the details of the
RE syntax.

It would be a delight to be able to use more advanced (IMHO) Perl-
compatible regexes in PostgreSQL. Perhaps a global configuration variable
to select which regex engine to use would help work around any backward
compatibility problems? Or, somewhat uglier, a different syntax in SQL for
the new regex engine?

Jon

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jon Jensen (#19)
Re: POSIX regex performance bug in 7.3 Vs. 7.2

Jon Jensen <jon@endpoint.com> writes:

It would be a delight to be able to use more advanced (IMHO) Perl-
compatible regexes in PostgreSQL.

After some further research, pcre does seem like an interesting
alternative. Both pcre and Spencer's new code have essentially
Berkeley-style licenses, so there's no problem there. Some
relevant comparisons:

1. pcre tries to be exactly compatible with Perl, so details of its
regex flavor will be familiar to many more people than the Tcl flavor
(by and large the features are similar, but there are differences).

2. pcre is already distributed as a nice tidy library; we need not
extract code from the Tcl distribution.

3. pcre is actively maintained (although tracking a new release every
couple months may not be something we really want to do, anyway).
AFAICT Henry's not doing anything much with his code, so it'd be
pretty much take-once-and-maintain-for-ourselves.

4. pcre looks like it's probably *not* as well suited to a multibyte
environment. In particular, I doubt that its UTF8 compile option was
even turned on for the performance comparison Neil cited --- and the man
page only promises "experimental, incomplete support for UTF-8 encoded
strings". The Tcl code by contrast is used only in a multibyte
environment, so that's the supported, optimized path. It doesn't even
assume null-terminated strings (yay).

5. As best I can tell so far, neither code is currently set up for
run-time choice of encoding; we'd have to do some work for that in
either case. (This probably means that tracking pcre update releases
would be problematic anyhow.)

6. According to Friedl's book, the Tcl engine (Spencer's new code)
is way faster than Perl's, and so presumably faster than pcre, though
I can't find any specific measurements of pcre in the book. It uses a
hybrid DFA/NFA approach which Friedl considers state of the art.

Strict Perl compatibility would be a nice feature, but right at the
moment the multibyte issue is looking like the determining factor.
If we don't get a multibyte-optimized engine out of this change, we're
wasting our time.

regards, tom lane

#21Sean Chittenden
sean@chittenden.org
In reply to: Tom Lane (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: wade (#13)
#23Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#16)
#24Hannu Krosing
hannu@tm.ee
In reply to: Neil Conway (#17)
#25Neil Conway
neilc@samurai.com
In reply to: Tom Lane (#20)
#26Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#20)
#27Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#22)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#25)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#27)
#30Neil Conway
neilc@samurai.com
In reply to: Tom Lane (#28)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#30)
#32Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#28)
#33Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#31)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#33)
#35Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#32)
#37wade
wade@wavefire.com
In reply to: Tom Lane (#36)
#38Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#36)
#39Bruce Momjian
bruce@momjian.us
In reply to: Tatsuo Ishii (#38)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#39)