Slightly insane use of USE_WIDE_UPPER_LOWER in pg_trgm

Started by Tom Laneabout 13 years ago5 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

While reviewing the latest incarnation of the regex indexing patch,
I noticed that make_trigrams() in contrib/pg_trgm/trgm_op.c is coded
so that if USE_WIDE_UPPER_LOWER is not set, it ignores multibyte
character boundaries and just makes trigrams from 3-byte substrings.
This seems slightly insane, not least because there's an Assert there
that will fail if it's fed any multibyte characters. I suppose no one
has actually tried this code with non-ASCII data on machines where
USE_WIDE_UPPER_LOWER isn't set; at least not with Asserts turned on.
(Considering that even my favorite dinosaur HPUX machine has got both
HAVE_WCSTOMBS and HAVE_TOWLOWER, it may well be that there *aren't* any
such machines anymore.)

So I'm inclined to remove the two #ifdef USE_WIDE_UPPER_LOWER tests
in trgm_op.c, and just use the multibyte-aware code all the time.
A downside of this is that if there is indeed anyone out there storing
non-ASCII trigrams on a machine without USE_WIDE_UPPER_LOWER, their
indexes would break if they pg_upgrade to 9.3. OTOH their indexes would
break anyway if they rebuilt against a more modern libc, or built with
Asserts on.

If we don't do this then we'll have to complicate the regex indexing
patch some more, since it's currently imagining that cnt_trigram()
is always the way to make storable trigrams from raw text, and this
is just wrong in the existing non-USE_WIDE_UPPER_LOWER code path.

Comments?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#1)
Re: Slightly insane use of USE_WIDE_UPPER_LOWER in pg_trgm

On Sun, Apr 7, 2013 at 7:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

While reviewing the latest incarnation of the regex indexing patch,
I noticed that make_trigrams() in contrib/pg_trgm/trgm_op.c is coded
so that if USE_WIDE_UPPER_LOWER is not set, it ignores multibyte
character boundaries and just makes trigrams from 3-byte substrings.
This seems slightly insane, not least because there's an Assert there
that will fail if it's fed any multibyte characters. I suppose no one
has actually tried this code with non-ASCII data on machines where
USE_WIDE_UPPER_LOWER isn't set; at least not with Asserts turned on.
(Considering that even my favorite dinosaur HPUX machine has got both
HAVE_WCSTOMBS and HAVE_TOWLOWER, it may well be that there *aren't* any
such machines anymore.)

So I'm inclined to remove the two #ifdef USE_WIDE_UPPER_LOWER tests
in trgm_op.c, and just use the multibyte-aware code all the time.
A downside of this is that if there is indeed anyone out there storing
non-ASCII trigrams on a machine without USE_WIDE_UPPER_LOWER, their
indexes would break if they pg_upgrade to 9.3. OTOH their indexes would
break anyway if they rebuilt against a more modern libc, or built with
Asserts on.

If we don't do this then we'll have to complicate the regex indexing
patch some more, since it's currently imagining that cnt_trigram()
is always the way to make storable trigrams from raw text, and this
is just wrong in the existing non-USE_WIDE_UPPER_LOWER code path

+1 for removing #ifdef USE_WIDE_UPPER_LOWER tests. Even if it works
somewhere with non-ASCII data without USE_WIDE_UPPER_LOWER then anyway it's
a buggy logic with invalid results.

It's also likely we can change
if (pg_database_encoding_max_length() > 1)
into something like
if (pg_database_encoding_max_length() > 1 && bytelen != charlen)

------
With best regards,
Alexander Korotkov.

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#2)
Re: Slightly insane use of USE_WIDE_UPPER_LOWER in pg_trgm

Alexander Korotkov <aekorotkov@gmail.com> writes:

It's also likely we can change
if (pg_database_encoding_max_length() > 1)
into something like
if (pg_database_encoding_max_length() > 1 && bytelen != charlen)

Hm, actually couldn't we just simplify to "if (bytelen != charlen)"?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#3)
Re: Slightly insane use of USE_WIDE_UPPER_LOWER in pg_trgm

On Sun, Apr 7, 2013 at 10:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <aekorotkov@gmail.com> writes:

It's also likely we can change
if (pg_database_encoding_max_length() > 1)
into something like
if (pg_database_encoding_max_length() > 1 && bytelen != charlen)

Hm, actually couldn't we just simplify to "if (bytelen != charlen)"?

I think yes, we can :)

------
With best regards,
Alexander Korotkov.

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#4)
Re: Slightly insane use of USE_WIDE_UPPER_LOWER in pg_trgm

Alexander Korotkov <aekorotkov@gmail.com> writes:

On Sun, Apr 7, 2013 at 10:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hm, actually couldn't we just simplify to "if (bytelen != charlen)"?

I think yes, we can :)

OK. I pushed this as a separate commit so as to highlight the potential
incompatibility in the commit log. I am not sure it's even worth a
release note though ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers