Better locale-specific-character-class handling for regexps

Started by Tom Laneover 9 years ago8 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

I got tired of hearing complaints about the issue described in
this thread:
/messages/by-id/24241.1329347196@sss.pgh.pa.us

Here's a proposed fix. I've not done extensive performance testing,
but it seems to be as fast or faster than the old code in cases where
there are not too many "large" characters in the input. And, more
to the point, it gets the right answer for such large characters.

I'll add this to the upcoming commitfest.

regards, tom lane

Attachments:

better-regex-colormaps-1.patchtext/x-diff; charset=us-ascii; name=better-regex-colormaps-1.patchDownload+1239-1227
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#1)
Re: Better locale-specific-character-class handling for regexps

On 08/23/2016 03:54 AM, Tom Lane wrote:

! That's still not quite enough, though, because of locale-dependent
! character classes such as [[:alpha:]]. In Unicode locales these classes
! may have thousands of entries that are above MAX_SIMPLE_CHR, and we
! certainly don't want to be searching large colormaprange arrays at runtime.
! Nor do we even want to spend the time to initialize cvec structures that
! exhaustively describe all of those characters. Our solution is to compute
! exact per-character colors at compile time only up to MAX_SIMPLE_CHR.
! For characters above that, we apply the <ctype.h> or <wctype.h> lookup
! functions at runtime for each locale-dependent character class used in the
! regex pattern, constructing a bitmap that describes which classes the
! runtime character belongs to. The per-character-range data structure
! mentioned above actually holds, for each range, a separate color entry
! for each possible combination of character class properties. That is,
! the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
! whose rows correspond to character ranges that are explicitly mentioned
! in the input, and whose columns correspond to sets of relevant locale
! character classes.

I think that last sentence should say "whose rows correspond to
character ranges that are explicitly mentioned in the *regex pattern*",
rather than "in the input".

An example would be very helpful here. I had to read through this many
times, until I understood it. I can easily come up with examples for
character classes, but not for "high" character-ranges. The best I could
come up with is to check if a characters belongs to some special group
of unicode characters, like U&'[\+01D100-\+01D1FF]' to check for musical
symbol characters. In practice, I guess you will only see single
characters in the colormaprange array, although we must of course cope
with ranges too.

+       /* this relies on WHITE being zero: */
+       memset(cm->locolormap, WHITE,
+                  (MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));
+
+       memset(cm->classbits, 0, sizeof(cm->classbits));
+       cm->numcmranges = 0;
+       cm->cmranges = NULL;
+       cm->maxarrayrows = 4;           /* arbitrary initial allocation */
+       cm->hiarrayrows = 1;
+       cm->hiarraycols = 1;
+       cm->hicolormap = (color *) MALLOC(cm->maxarrayrows * sizeof(color));
+       if (cm->hicolormap == NULL)
+       {
+               CERR(REG_ESPACE);
+               return;
+       }
+       /* initialize the "all other characters" row to WHITE */
+       cm->hicolormap[0] = WHITE;

Is the comment correct? I don't see why this wouldn't work with "WHITE
!= 0".

! /* Duplicate existing columns to the right, and increase ref counts */
! /* Must work downwards in the array because we realloc'd in place */
! for (r = cm->hiarrayrows - 1; r >= 0; r--)
! {
! color *oldrowptr = &newarray[r * cm->hiarraycols];
! color *newrowptr = &newarray[r * cm->hiarraycols * 2];
! color *newrowptr2 = newrowptr + cm->hiarraycols;

! for (c = 0; c < cm->hiarraycols; c++)
! {
! color co = oldrowptr[c];
!
! newrowptr[c] = newrowptr2[c] = co;
! cm->cd[co].nuchrs++;
! }
! }

Perhaps "backwards" would be clearer than "downwards"? At least in my
mental model, index 0 is the top row of an array, so "downwards" means
0, 1, 2. I guess you meant downwards numerically, rather than visually,
but it took me a moment to process that.

+1 for this patch in general. Some regression test cases would be nice.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#2)
Re: Better locale-specific-character-class handling for regexps

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 08/23/2016 03:54 AM, Tom Lane wrote:

! the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
! whose rows correspond to character ranges that are explicitly mentioned
! in the input, and whose columns correspond to sets of relevant locale
! character classes.

I think that last sentence should say "whose rows correspond to
character ranges that are explicitly mentioned in the *regex pattern*",
rather than "in the input".

OK. The pattern is the input under consideration by regcomp.c, but I see
your point that someone might be confused.

An example would be very helpful here.

I'll see what I can do.

I can easily come up with examples for
character classes, but not for "high" character-ranges. The best I could
come up with is to check if a characters belongs to some special group
of unicode characters, like U&'[\+01D100-\+01D1FF]' to check for musical
symbol characters.

Right, or someone might write the equivalent of [A-Z] in the Klingon
alphabet, or whatever --- it wouldn't necessarily have to use backslash
notation, it could be literal characters with large codes.

In practice, I guess you will only see single
characters in the colormaprange array, although we must of course cope
with ranges too.

I suspect that's true; I'm not sure that there are many places in the
upper reaches of Unicode space where ranges of consecutive character
codes form natural units to search for.

+       /* this relies on WHITE being zero: */
+       memset(cm->locolormap, WHITE,
+                  (MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));

Is the comment correct? I don't see why this wouldn't work with "WHITE
!= 0".

The array entries are shorts not chars, so if WHITE were say 2, memset
would fill the array with 0x0202 rather than 2. I suppose if WHITE were
say 0x0303 then it would accidentally work, but I don't think the comment
needs to get into that.

Perhaps "backwards" would be clearer than "downwards"?

OK.

+1 for this patch in general. Some regression test cases would be nice.

I'm not sure how to write such tests without introducing insurmountable
platform dependencies --- particularly on platforms with weak support for
UTF8 locales, such as OS X. All the interesting cases require knowing
what iswalpha() etc will return for some high character codes.

What I did to test it during development was to set MAX_SIMPLE_CHR to
something in the ASCII range, so that the high-character-code paths could
be tested without making any assumptions about locale classifications for
non-ASCII characters. I'm not sure that's a helpful idea for regression
testing purposes, though.

I guess I could follow the lead of collate.linux.utf8.sql and produce
a test that's only promised to pass on one platform with one encoding,
but I'm not terribly excited by that. AFAIK that test file does not
get run at all in the buildfarm or in the wild.

Thanks for reviewing!

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#1)
Re: Better locale-specific-character-class handling for regexps

I wrote:

I got tired of hearing complaints about the issue described in
this thread:
/messages/by-id/24241.1329347196@sss.pgh.pa.us
Here's a proposed fix. I've not done extensive performance testing,
but it seems to be as fast or faster than the old code in cases where
there are not too many "large" characters in the input. And, more
to the point, it gets the right answer for such large characters.

I've now done some performance testing on this patch. I ran the
attached scripts in cassert-off builds, using en_US.utf8 locale on
a RHEL6 box. Each number is the best of 3 runs; all times in ms.

The first two test cases are meant to exercise regex compile speed; they
provide relatively short input strings, and each call provides a slightly
different pattern so as to defeat the compiled-pattern cache in
adt/regexp.c.

HEAD patched

compilespeed.sql 61959.793 32936.773
compilespeed09.sql 6411.957 5903.211

compilespeed.sql exercises compiling '\w' which of course is locale
dependent, while compilespeed09.sql compiles '[0-9]' which is not.
Since the patch sets MAX_SIMPLE_CHR to 0x7FF which is the same as
regc_pg_locale.c's old cutoff for how many characters to examine,
we must be doing about the same number of compile-time iswalnum()
probes in both old and new code. The compile speedup must come entirely
from the fact that the new code uses a simple array for the low-char-codes
colormap, whereas the old code has a complicated multilevel colormap that
is more expensive to update. Then the bigger win for compilespeed.sql
must be due to the fact that \w will require changing the colors of more
characters in the colormap. I was not expecting to see such a big win,
but I'll take it ;-)

The remaining cases hold the pattern fixed (so it's not recompiled
each time) and are meant to measure the per-character scan speed
while executing a compiled regexp.

HEAD patched ratio of per-char times

runspeedlo.sql 12951.588 12053.533 0.91
runspeedhi.sql 10410.140 * 36678.876 2.72 (estimated)
runspeednotlo.sql 12999.629 12130.439 0.91
runspeednothi.sql 15678.125 20053.861 1.36
dummy.sql 3444.696 3437.984 (to measure overhead)

* gives wrong answer

In the runspeedhi.sql case, HEAD thinks the first character doesn't match
the pattern, so it's not scanning the whole input, which makes that
measurement not useful. We can assume though that the scan rate would
have been about the same as runspeednothi.sql, and the estimated per-char
ratio shown in the table is computed on that basis.

So the patched code's scan speed is about 9% better than HEAD for data
characters below MAX_SIMPLE_CHR, presumably because of the simpler color
map structure. But it's noticeably worse for data chars above that, and
much worse if the pattern forces runtime iswalnum() tests. That's what
I was expecting.

It's worth noticing that comparing runspeednotlo.sql and runspeednothi.sql
shows that HEAD is 28% worse per-character on the latter. Those cases are
exactly the same so far as HEAD's regex engine is concerned, so I think
that that differential must be blamable entirely on the fact that we're
converting 2-byte UTF8 sequences into pg_wchars in the first case and
3-byte UTF8 sequences in the second case. The patched code is also paying
that cost of course, so you should take it into account when comparing
numbers in different rows of this table. And it's also an indication
that really, this code is pretty frickin fast --- the total time per
character is only circa 4x the extra logic to pick up and shift in
one more byte!

Overall I'm pretty happy with these results. I think that most cases
will be as fast or faster than before, assuming that most data characters
are below MAX_SIMPLE_CHR for most use-cases. In the worst cases, the
per-character cost can be several times what it was before, but those
are exactly the same cases where *we gave the wrong answer* before, so
it's hard to complain too much.

regards, tom lane

Attachments:

compilespeed.sqltext/plain; charset=us-ascii; name=compilespeed.sqlDownload
compilespeed09.sqltext/plain; charset=us-ascii; name=compilespeed09.sqlDownload
runspeedlo.sqltext/plain; charset=us-ascii; name=runspeedlo.sqlDownload
runspeedhi.sqltext/plain; charset=us-ascii; name=runspeedhi.sqlDownload
runspeednotlo.sqltext/plain; charset=us-ascii; name=runspeednotlo.sqlDownload
runspeednothi.sqltext/plain; charset=us-ascii; name=runspeednothi.sqlDownload
dummy.sqltext/plain; charset=us-ascii; name=dummy.sqlDownload
#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#3)
Re: Better locale-specific-character-class handling for regexps

On 09/04/2016 08:44 PM, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 08/23/2016 03:54 AM, Tom Lane wrote:
+1 for this patch in general. Some regression test cases would be nice.

I'm not sure how to write such tests without introducing insurmountable
platform dependencies --- particularly on platforms with weak support for
UTF8 locales, such as OS X. All the interesting cases require knowing
what iswalpha() etc will return for some high character codes.

What I did to test it during development was to set MAX_SIMPLE_CHR to
something in the ASCII range, so that the high-character-code paths could
be tested without making any assumptions about locale classifications for
non-ASCII characters. I'm not sure that's a helpful idea for regression
testing purposes, though.

I guess I could follow the lead of collate.linux.utf8.sql and produce
a test that's only promised to pass on one platform with one encoding,
but I'm not terribly excited by that. AFAIK that test file does not
get run at all in the buildfarm or in the wild.

I'm not too worried if the tests don't get run regularly, but I don't
like the idea that only works on one platform. This code is unlikely to
be accidentally broken by unrelated changes in the backend, as the
regexp code is very well isolated. But for someone hacks on the regexp
library in the future, having a test suite to tickle all these
corner-cases would be very handy.

Another class of regressions would be that something changes in the way
a locale treats some characters, and that breaks an application. That
would be very difficult to test for in a platform-independent way. That
wouldn't really our bug, though, but the locale's.

Since we're now de facto maintainers of this regexp library, and our
version could be used somewhere else than PostgreSQL too, it would
actually be nice to have a regression suite that's independent from the
pg_regress infrastructure, and wouldn't need a server to run. Perhaps a
stand-alone C program that compiles the regexp code with mock versions
of pg_wc_is* functions. Or perhaps a magic collation OID that makes
pg_wc_is* functions to return hard-coded values for particular inputs.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#5)
Re: Better locale-specific-character-class handling for regexps

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 09/04/2016 08:44 PM, Tom Lane wrote:

I guess I could follow the lead of collate.linux.utf8.sql and produce
a test that's only promised to pass on one platform with one encoding,
but I'm not terribly excited by that. AFAIK that test file does not
get run at all in the buildfarm or in the wild.

I'm not too worried if the tests don't get run regularly, but I don't
like the idea that only works on one platform.

Well, it would work on any platform that reports high Unicode letters
as letters. The problem for putting this into the regular regression
tests is that the generic tests don't even assume UTF8 encoding, let
alone a Unicode-ish locale.

Since we're now de facto maintainers of this regexp library, and our
version could be used somewhere else than PostgreSQL too, it would
actually be nice to have a regression suite that's independent from the
pg_regress infrastructure, and wouldn't need a server to run.

If anyone ever really picks up the challenge of making the regexp library
a standalone project, I think one of the first orders of business would be
to pull out the Tcl project's regexp-related regression tests. There's a
pretty extensive set of tests written by Henry Spencer himself, and more
that they added over the years; it's far more comprehensive than our
tests. (I've looked at stealing that test set in toto, but it requires
some debug APIs that we don't expose in SQL, and probably don't want to.)

In any case, this is getting very far afield from the current patch.
I'm willing to add a regexp.linux.ut8.sql test file if you think it's
important to have some canned tests that exercise this new code, but
otherwise I don't see any near-term solution.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#6)
Re: Better locale-specific-character-class handling for regexps

On 09/05/2016 07:10 PM, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 09/04/2016 08:44 PM, Tom Lane wrote:

I guess I could follow the lead of collate.linux.utf8.sql and produce
a test that's only promised to pass on one platform with one encoding,
but I'm not terribly excited by that. AFAIK that test file does not
get run at all in the buildfarm or in the wild.

I'm not too worried if the tests don't get run regularly, but I don't
like the idea that only works on one platform.

Well, it would work on any platform that reports high Unicode letters
as letters. The problem for putting this into the regular regression
tests is that the generic tests don't even assume UTF8 encoding, let
alone a Unicode-ish locale.

Ah, ok. I thought there were some more special requirements.

Since we're now de facto maintainers of this regexp library, and our
version could be used somewhere else than PostgreSQL too, it would
actually be nice to have a regression suite that's independent from the
pg_regress infrastructure, and wouldn't need a server to run.

If anyone ever really picks up the challenge of making the regexp library
a standalone project, I think one of the first orders of business would be
to pull out the Tcl project's regexp-related regression tests. There's a
pretty extensive set of tests written by Henry Spencer himself, and more
that they added over the years; it's far more comprehensive than our
tests. (I've looked at stealing that test set in toto, but it requires
some debug APIs that we don't expose in SQL, and probably don't want to.)

Oh, interesting.

In any case, this is getting very far afield from the current patch.
I'm willing to add a regexp.linux.ut8.sql test file if you think it's
important to have some canned tests that exercise this new code, but
otherwise I don't see any near-term solution.

Ok, that'll do.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#7)
Re: Better locale-specific-character-class handling for regexps

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 09/05/2016 07:10 PM, Tom Lane wrote:

In any case, this is getting very far afield from the current patch.
I'm willing to add a regexp.linux.ut8.sql test file if you think it's
important to have some canned tests that exercise this new code, but
otherwise I don't see any near-term solution.

Ok, that'll do.

OK, I'll put something together. Thanks again for reviewing.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers