like and optimization

Started by Juriy Goloveshkinalmost 25 years ago13 messages
#1Juriy Goloveshkin
juriy@avias.com
1 attachment(s)

Hello, I didn't know pgsql-sources close,
so I wrote this code just as example of idea.
Can somebody review and make patch for pgsql?
(if this idea is good, of cource).

like-optimization is working only with ASCII, but it is simple to fix.
This programm makes greater string(I tested with KOI8-R):

--
Bye
Juriy Goloveshkin

Attachments:

asd.ctext/plain; charset=us-asciiDownload
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Juriy Goloveshkin (#1)
Re: like and optimization

Juriy Goloveshkin <juriy@avias.com> writes:

Hello, I didn't know pgsql-sources close,
so I wrote this code just as example of idea.
Can somebody review and make patch for pgsql?

AFAICT this only deals with the issue of single-byte characters that
sort in an order different from their numeric order. The existing
make_greater_string() code already deals with that case. Where it
falls down is cases where sorting is context-dependent (multi-pass
sort rules, digraphs, things like that). But I don't see anything
here that would make such cases work.

If you're trying to tell us that the 7.0.* code works correctly for
KOI8-R locale, we'd be glad to re-enable LIKE optimization for that
locale ...

regards, tom lane

#3Noname
robn@verdi.et.tudelft.nl
In reply to: Tom Lane (#2)
Re: like and optimization

On Sun, 21 Jan 2001 00:25:17 +0000 (UTC), Tom Lane <tgl@sss.pgh.pa.us> wrote:

Juriy Goloveshkin <juriy@avias.com> writes:

Hello, I didn't know pgsql-sources close,
so I wrote this code just as example of idea.
Can somebody review and make patch for pgsql?

AFAICT this only deals with the issue of single-byte characters that
sort in an order different from their numeric order. The existing
make_greater_string() code already deals with that case. Where it
falls down is cases where sorting is context-dependent (multi-pass
sort rules, digraphs, things like that). But I don't see anything
here that would make such cases work.

If you're trying to tell us that the 7.0.* code works correctly for
KOI8-R locale, we'd be glad to re-enable LIKE optimization for that
locale ...

Hello,

I have no knowledge of postgres internals at all (yet !), and I'm not
quite sure what this thread is exactly about.

But if anybody thinks that selects with LIKE on indexed columns with
single-byte non-ASCII characters are working OK: they are not !! See my
posting and following thread "7.0.3 reproduceable serious select error"
from a couple of days ago.

I made a reproduceable example of things going wrong with a "en_US"
locale which is the widely-used (single-byte) ISO-8859-1 Latin 1 charset.

Please excuse me if this has nothing to do with what you are talking
about. I'm just very eager to get rid of this (for our application)
extremely nasty bug !

friendly greetings,
Rob van Nieuwkerk

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#3)
Re: like and optimization

robn@verdi.et.tudelft.nl (Rob van Nieuwkerk) writes:

But if anybody thinks that selects with LIKE on indexed columns with
single-byte non-ASCII characters are working OK: they are not !! See my
posting and following thread "7.0.3 reproduceable serious select error"
from a couple of days ago.

Yes, we know :-(. That's why that optimization is currently disabled
for non-ASCII locales in 7.1. Juriy appears to be saying that it does
work OK in KOI8-R locale.

I made a reproduceable example of things going wrong with a "en_US"
locale which is the widely-used (single-byte) ISO-8859-1 Latin 1 charset.

en_US uses multi-pass collation rules. It's those collation rules, not
the charset per se, that causes the problem.

regards, tom lane

#5Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Tom Lane (#4)
AW: like and optimization

I made a reproduceable example of things going wrong with a "en_US"
locale which is the widely-used (single-byte) ISO-8859-1 Latin 1 charset.

en_US uses multi-pass collation rules. It's those collation rules, not
the charset per se, that causes the problem.

Just to understand things correctly. Is the Like optimization disabled
for all non-ASCII char sets, or (imho correctly) for non charset ordered
collations (LC_COLLATE) ?

Thus can you enable index optimization by simply setting
LC_COLLATE to C if your LANG is not set to C ?

Andreas

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB (#5)
Re: AW: like and optimization

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

Just to understand things correctly. Is the Like optimization disabled
for all non-ASCII char sets, or (imho correctly) for non charset ordered
collations (LC_COLLATE) ?

Currently it's disabled whenever LC_COLLATE is neither C nor POSIX.
We can add other names to the "OK" list as we verify that they are safe
(see locale_is_like_safe() in src/backend/utils/adt/selfuncs.c).

regards, tom lane

#7Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#6)
Re: AW: like and optimization

Tom Lane writes:

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

Just to understand things correctly. Is the Like optimization disabled
for all non-ASCII char sets, or (imho correctly) for non charset ordered
collations (LC_COLLATE) ?

Currently it's disabled whenever LC_COLLATE is neither C nor POSIX.
We can add other names to the "OK" list as we verify that they are safe
(see locale_is_like_safe() in src/backend/utils/adt/selfuncs.c).

I have pretty severe doubts that any locale for a language that uses the
Latin, Cyrillic, or Greek alphabets (i.e., those that are conceptually
similar to English) is like-optimization safe (for the optimization
algorithm in its current state), at least across all platforms.
Somewhere a vendor is going to adhere to some ISO standard and implement
the same multi-pass "letters first" rules that we observed in en_US.

There should be some extensive "stress test" that a locale should have to
pass before being labelled safe.

--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/

#8Hannu Krosing
hannu@tm.ee
In reply to: Peter Eisentraut (#7)
Re: AW: like and optimization

Peter Eisentraut wrote:

Tom Lane writes:

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

Just to understand things correctly. Is the Like optimization disabled
for all non-ASCII char sets, or (imho correctly) for non charset ordered
collations (LC_COLLATE) ?

Currently it's disabled whenever LC_COLLATE is neither C nor POSIX.
We can add other names to the "OK" list as we verify that they are safe
(see locale_is_like_safe() in src/backend/utils/adt/selfuncs.c).

I have pretty severe doubts that any locale for a language that uses the
Latin, Cyrillic, or Greek alphabets (i.e., those that are conceptually
similar to English) is like-optimization safe (for the optimization
algorithm in its current state), at least across all platforms.
Somewhere a vendor is going to adhere to some ISO standard and implement
the same multi-pass "letters first" rules that we observed in en_US.

Is there any possibility to use, in a portable way, only our own locale
definition files, without reimplementing all the sorts uppercases etc. ?

If we had control over the locale definition contents we would be much
better
off when optimizing as well.

And IIRC SQL9x prescribe support for multiple locales (or at least
multiple
collating sequences) within one database simultaneously.

There should be some extensive "stress test" that a locale should have to
pass before being labelled safe.

Sure.

-------------
Hannu

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#8)
Re: AW: like and optimization

Hannu Krosing <hannu@tm.ee> writes:

Is there any possibility to use, in a portable way, only our own locale
definition files, without reimplementing all the sorts uppercases etc. ?

AFAIK there is not --- the standard C library APIs do not specify how to
represent this information. Thus, we'd have to provide our own complete
implementation of locale-specific comparisons, etc, etc. Not to mention
acquiring all the raw data for the locale definitions.

I think we'd be nuts to try to develop and maintain our own
implementation of that. What we should probably think about is somehow
piggybacking on someone else's i18n library work, with just enough
tweaking of the source so that it can cope efficiently with N different
locales at runtime, instead of only one.

The situation is not too much different for timezones, BTW. Might make
sense to deal with both of those problems in the same way.

Are there any BSD-license locale and/or timezone libraries that we might
assimilate in this way? We could use an LGPL'd library if there is no
other alternative, but I'd just as soon not open up the license issue.

regards, tom lane

#10Noname
ncm@zembu.com
In reply to: Tom Lane (#9)
Re: AW: like and optimization

On Mon, Jan 22, 2001 at 05:46:09PM -0500, Tom Lane wrote:

Hannu Krosing <hannu@tm.ee> writes:

Is there any possibility to use, in a portable way, only our own locale
definition files, without reimplementing all the sorts uppercases etc. ?

The situation is not too much different for timezones, BTW. Might make
sense to deal with both of those problems in the same way.

The timezone situation is much better, in that there is a separate
organization which maintains a timezone database and code to operate
on it. It wouldn't be necessary to include the package with PG,
because it can be got at a standard place. You would only need
scripts to download, build, and integrate it.

Are there any BSD-license locale and/or timezone libraries that we might
assimilate in this way? We could use an LGPL'd library if there is no
other alternative, but I'd just as soon not open up the license issue.

Posix systems include a set of commands for dumping locales in a standard
format, and building from them. Instead of shipping locales and code to
operate on them, one might include a script to run these tools (where
they exist) to dump an existing locale, edit it a bit, and build a more
PG-friendly locale.

Nathan Myers
ncm@zembu.com

#11Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Hannu Krosing (#8)
Re: AW: like and optimization

And IIRC SQL9x prescribe support for multiple locales (or at least
multiple
collating sequences) within one database simultaneously.

Sounds like SQL92/99 COLLATE things is the way we should go, IMHO.
--
Tatsuo Ishii

#12Patrick Welche
prlw1@newn.cam.ac.uk
In reply to: Tom Lane (#9)
Re: AW: like and optimization

On Mon, Jan 22, 2001 at 05:46:09PM -0500, Tom Lane wrote:
...

Are there any BSD-license locale and/or timezone libraries that we might
assimilate in this way? We could use an LGPL'd library if there is no
other alternative, but I'd just as soon not open up the license issue.

The "Citrus Project" is coming up with with i18n for BSD.

FYI

Patrick

#13Patrick Welche
prlw1@newn.cam.ac.uk
In reply to: Noname (#10)
Re: AW: like and optimization

On Mon, Jan 22, 2001 at 03:09:03PM -0800, Nathan Myers wrote:
...

Posix systems include a set of commands for dumping locales in a standard
format, and building from them. Instead of shipping locales and code to
operate on them, one might include a script to run these tools (where
they exist) to dump an existing locale, edit it a bit, and build a more
PG-friendly locale.

Is there really a standard format for locales? Apparantly there are 3 different
ways of doing LC_COLLATE ?!

Cheers,

Patrick