Postgres 6.5 beta2 and beta3 problem

Started by Daniel Kalchevalmost 27 years ago44 messageshackers
Jump to latest
#1Daniel Kalchev
daniel@digsys.bg

Hello,

Sorry that I picked this too late in the release cycle, but other work
prevented my earlier involvement in the testing.

There are at least two serious problems that I discovered so far with Postgres
6.5 beta2 and beta3 (running on BSD/OS 4.0.1):

1. LIKE with indexes works worse than without indexes.

Given the following schema:

CREATE TABLE "words" (
"w_key" text,
"w_pages" text);
CREATE INDEX "w_k_words_i" on "words" using btree ( "w_key" "text_ops" );

The table words has 117743 records.

the folowing query:

select w_key from words where w_key like 'sometext%'

is explained as:

Index Scan using w_k_words_i on words (cost=3335.38 rows=1 width=12)

and runs for several minutes. If I drop the w_k_words_i index, the explain is:

Seq Scan on words (cost=7609.52 rows=1 width=12)

and the query runs noticeably faster.

Under 6.4 the behavior is as expected, much better with indexes.

2. Under Postgres 6.4 the following query:

SELECT config.serviceid, custid, datetime_date( updated_at ) as date ,archived
as a, c.subserviceid as ss, c.usage_price as
price, c.usage_included as time, service
FROM a, b, c
WHERE confid in ( SELECT confid
FROM a
WHERE archived_at > '30-04-1999'
AND created_at < '30-04-1999' )
AND not archived
AND a.serviceid=b.serviceid
AND c.serviceid=a.serviceid
GROUP BY custid, serviceid, subserviceid;

works, although runs for indefinitely long time (due to the subselect - but
this is not a problem, as it can be rewritten). Under Postgres 6.5 hwoever, it
is not accepted, because there are no aggregates in the target list. Is this
incorrect behavior of the 6.4.2 version or 6.5 has different syntax?

Regards,
Daniel Kalchev

#2Daniel Kalchev
daniel@digsys.bg
In reply to: Daniel Kalchev (#1)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Yes,

I do build with --enable-locale, but I do not build with --enable-mb and do
not use client_encoding or server_encoding.

The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the
environment in both server and client, and this has worked for me in 6.4.2.

Regards,
Daniel

Show quoted text

"Hiroshi Inoue" said:

-----Original Message-----
From: owner-pgsql-hackers@postgreSQL.org
[mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Daniel Kalchev
Sent: Wednesday, June 09, 1999 5:04 PM
To: pgsql-hackers@hub.org
Subject: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Hello,

Sorry that I picked this too late in the release cycle, but other work
prevented my earlier involvement in the testing.

There are at least two serious problems that I discovered so far
with Postgres
6.5 beta2 and beta3 (running on BSD/OS 4.0.1):

1. LIKE with indexes works worse than without indexes.

Did you built with --enable-locale ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#3Bruce Momjian
bruce@momjian.us
In reply to: Daniel Kalchev (#2)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Yes,

I do build with --enable-locale, but I do not build with --enable-mb and do
not use client_encoding or server_encoding.

The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the
environment in both server and client, and this has worked for me in 6.4.2.

This certainly explains it. With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

We didn't do the check in 6.4.*, and LIKE was not returning the proper
results for queries at those sites that used locale.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#4Oleg Bartunov
oleg@sai.msu.su
In reply to: Bruce Momjian (#3)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

On Wed, 9 Jun 1999, Bruce Momjian wrote:

Date: Wed, 9 Jun 1999 08:43:50 -0400 (EDT)
From: Bruce Momjian <maillist@candle.pha.pa.us>
To: Daniel Kalchev <daniel@digsys.bg>
Cc: Hiroshi Inoue <Inoue@tpf.co.jp>, pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Yes,

I do build with --enable-locale, but I do not build with --enable-mb and do
not use client_encoding or server_encoding.

The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the
environment in both server and client, and this has worked for me in 6.4.2.

This certainly explains it. With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

If so, why explain reports 'Index Scan ....' ?
apod=> \d
Database    = apod
 +------------------+----------------------------------+----------+
 |  Owner           |             Relation             |   Type   |
 +------------------+----------------------------------+----------+
 | megera           | idx_adate                        | index    |
 | megera           | idx_atitle                       | index    |
 | megera           | idx_url                          | index    |
 | megera           | titles                           | table    |
 +------------------+----------------------------------+----------+

apod=> explain select atitle from titles where atitle like 'Sun%';
NOTICE: QUERY PLAN:

Index Scan using idx_atitle on titles (cost=33.28 rows=1 width=12)

Current cvs, --enable-locale --with-mb=KOI8

Regards,

Oleg

We didn't do the check in 6.4.*, and LIKE was not returning the proper
results for queries at those sites that used locale.

-- 
Bruce Momjian                        |  http://www.op.net/~candle
maillist@candle.pha.pa.us            |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#5Daniel Kalchev
daniel@digsys.bg
In reply to: Bruce Momjian (#3)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Bruce,

This is extremely bad news, because if will make PostgreSQL 6.5 unusable for
most of my applications. Perhaps something can be done to resolve this
situation?

I understand the problem for 16-bit characters support, but for most of the
encodings that support only 8 bit characters it should be safe to assume the
maximum character value is 255.

Anyway, making this check compile-time defined would certainly fix things
here, because in my case the cyrillic letters order match that of the binary
encoding (that is, the first alphabet letter is before the second etc).

Perhaps the locale data can be used to gather this information?

I will do some testing without using locale to see what happens.

Regards,
Daniel

Bruce Momjian said:

Yes,

I do build with --enable-locale, but I do not build with --enable-mb and d

o

not use client_encoding or server_encoding.

The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the
environment in both server and client, and this has worked for me in 6.4.2

.

Show quoted text

This certainly explains it. With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

We didn't do the check in 6.4.*, and LIKE was not returning the proper
results for queries at those sites that used locale.

-- 
Bruce Momjian                        |  http://www.op.net/~candle
maillist@candle.pha.pa.us            |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#6Daniel Kalchev
daniel@digsys.bg
In reply to: Daniel Kalchev (#5)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Further to my cyrillic charset (--enable-locale) introduced woes, I would like
to add, that when recompiling Postgres 6.5 without --enable-locale now the
query runs reasonably fast and explain gives:

Index Scan using w_k_words_i on words (cost=1112.13 rows=1 width=12)

instead of

Index Scan using w_k_words_i on words (cost=3335.38 rows=1 width=12)

The problem as it seems is that the restrictions to use indexes when locale is
enabled are not consistently applied - explain shows that indices will be
used, and the behavior with indexes and without indexes is different (with
indexes it's noticeably slower :-) so indexes are apparently being used...

Apparently (for my current tests at least) the Bulgarian cyrillic
(windows-1251) is handled reasonably well without locale support - untill now
~* didn't work anyway.

Daniel Kalchev

#7Oleg Bartunov
oleg@sai.msu.su
In reply to: Daniel Kalchev (#5)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

On Wed, 9 Jun 1999, Daniel Kalchev wrote:

Date: Wed, 09 Jun 1999 16:15:58 +0300
From: Daniel Kalchev <daniel@digsys.bg>
To: Bruce Momjian <maillist@candle.pha.pa.us>
Cc: pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Bruce,

This is extremely bad news, because if will make PostgreSQL 6.5 unusable for
most of my applications. Perhaps something can be done to resolve this
situation?

I understand the problem for 16-bit characters support, but for most of the
encodings that support only 8 bit characters it should be safe to assume the
maximum character value is 255.

Anyway, making this check compile-time defined would certainly fix things
here, because in my case the cyrillic letters order match that of the binary
encoding (that is, the first alphabet letter is before the second etc).

Perhaps the locale data can be used to gather this information?

It's certainly there ! locale data contains all information about
specific character set and encoding. Is it possible to use it to create
indices ? It should be slower but benefits of using indices will cover
expenses for non-US people. I didn't notice such behaivour in Informix
and Oracle. Fixing this would be a good point in respect of popularity
of Postgres. Are there any chance to place it in TODO for 6.6 ?
At least explain should reflect such fact !

Regards,

Oleg

I will do some testing without using locale to see what happens.

Regards,
Daniel

Bruce Momjian said:

Yes,

I do build with --enable-locale, but I do not build with --enable-mb and d

o

not use client_encoding or server_encoding.

The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the
environment in both server and client, and this has worked for me in 6.4.2

.

This certainly explains it. With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

We didn't do the check in 6.4.*, and LIKE was not returning the proper
results for queries at those sites that used locale.

-- 
Bruce Momjian                        |  http://www.op.net/~candle
maillist@candle.pha.pa.us            |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Oleg Bartunov (#7)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Daniel Kalchev <daniel@digsys.bg> writes:

1. LIKE with indexes works worse than without indexes.

Since you are using USE_LOCALE, the parser is inserting only a one-sided
index restriction; that is
WHERE w_key like 'sometext%'
becomes
WHERE w_key like 'sometext%' AND w_key >= 'sometext'
whereas without USE_LOCALE it becomes
WHERE w_key like 'sometext%' AND w_key >= 'sometext'
AND w_key <= 'sometext\377'
6.4 always did the latter, which was wrong in non-ASCII locales because
\377 might not be the highest character in the sort order. (Strictly
speaking it's wrong in ASCII locale as well...)

Of course, the one-sided index restriction is much less selective than
the two-sided; depending on what 'sometext' actually is, you might end
up scanning most of the table, and since index scan is much slower
per-tuple-scanned than sequential scan, you lose. That's evidently
what's happening here.

I suspect that the optimizer's cost estimates need refinement;
it should be able to guess that the sequential scan will be the
faster choice here.

Of course what you really want is a two-sided index restriction,
but we are not going to be able to fix that until someone figures
out a locale-independent way of generating a "slightly larger"
comparison string. So far I have not heard any proposals that
sound like they will work...

Under Postgres 6.5 hwoever, it
is not accepted, because there are no aggregates in the target list.

No, that's not what it's unhappy about; it's unhappy because there
are ungrouped fields used in the target list. This is erroneous
SQL because there's no unique choice of value to return for such an
field (if several tuples are grouped together, which one's value
of the field do you use?) Prior versions of Postgres failed to detect
this error, but it's an error. You were getting randomly selected
values for the ungrouped fields, I suppose.

regards, tom lane

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#8)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Bruce Momjian <maillist@candle.pha.pa.us> writes:

This certainly explains it. With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

We don't actually need the *maximum* character value, what we need is
to be able to generate a *slightly larger* character value.

For example, what the parser is doing now:
fld LIKE 'abc%' ==> fld <= 'abc\377'
is not even really right in ASCII locale, because it will reject a
data value like 'abc\377x'.

I think what we really want is to generate the "next value of the
same length" and use a < comparison. In ASCII locale this means
fld LIKE 'abc%' ==> fld < 'abd'
which is reliable regardless of what comes after abc in the data.

The trick is to figure out a "next" value without assuming a lot
about the local character set and collation sequence. I had
been thinking about a brute force method: generate a string and
check to see whether strcmp claims it's greater than the original
or not; if not, increment the last byte and try again. You'd
also have to be able to back up and increment earlier bytes if
you maxed out the last one. For example, in French locale,
fld LIKE 'ab\376%'
you'd first produce 'ab\377' but discover that it's less than
'ab\376' (because \377 is y-dieresis which sorts like 'y').
Your next try must be 'ac\377' which will succeed.

But I am worried whether this trick will work in multibyte locales ---
incrementing the last byte might generate an invalid character sequence
and produce unpredictable results from strcmp. So we need some help
from someone who knows a lot about collation orders and multibyte
character representations.

regards, tom lane

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#9)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Daniel Kalchev <daniel@digsys.bg> writes:

The problem as it seems is that the restrictions to use indexes when
locale is enabled are not consistently applied - explain shows that
indices will be used, and the behavior with indexes and without
indexes is different (with indexes it's noticeably slower :-) so
indexes are apparently being used...

Right, but what EXPLAIN doesn't show you (unless you can read the
much uglier EXPLAIN VERBOSE output) is what index restrictions are
being used.

LIKE doesn't know anything about indexes, nor vice versa. What the
index-scan machinery *does* know about is <, <=, etc. If you have
WHERE clauses like "x >= 33" and "x <= 54" then an index scan knows
to only scan the part of the index from 33 to 54. So you never even
visit a large fraction of the table. This is why an index scan can
be faster than a sequential scan even though the per-tuple overhead
of consulting the index is larger.

So, there is a special hack in the parser for LIKE (also for regexp
matches): if the parser sees that the match pattern has a fixed initial
substring, it inserts some >= and <= clauses that are designed to
exploit what the index scanner can do.

Our immediate problem is that we had to drop the <= clause in non-ASCII
locales because it was wrong. So now an index scan driven by LIKE
restrictions is not nearly as restrictive as it was, and has to visit
many tuples (about half the table on average) whereas before it was
likely to visit only a few, if you had a reasonably long fixed initial
string.

There are some other problems (notably, that the extra clauses are
inserted even if there's no index and thus no way that they will be
helpful) but those we know how to fix, and I hope to address them for
6.6. Fixing the <= problem requires knowledge about non-ASCII character
sets, and I for one don't know enough to fix it...

regards, tom lane

#11Andreas Zeugswetter
andreas.zeugswetter@telecom.at
In reply to: Tom Lane (#10)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Perhaps the locale data can be used to gather this information?

It's certainly there ! locale data contains all information about
specific character set and encoding. Is it possible to use it to create
indices ? It should be slower but benefits of using indices will cover
expenses for non-US people. I didn't notice such behaivour in Informix
and Oracle.

Informix has the national character handling, and it is indexable in
Informix.
But it is not done for the standard types char and varchar. In Informix
you use nchar and nvarchar, and have one locale defined per database.
In Oracle you have national characters, but access is not indexable.
Actually the SQL standard has something to say about national char
and varchar. I think it is wrong that char and varchar change their behavior
in postgresql, if you enable locale. A locale sensitive column needs to be
specified
as such in the create table statement according to the standard.
I never enable locale.

Andreas

#12Oleg Bartunov
oleg@sai.msu.su
In reply to: Andreas Zeugswetter (#11)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

On Wed, 9 Jun 1999, ZEUGSWETTER Andreas IZ5 wrote:

Date: Wed, 9 Jun 1999 19:16:14 +0200
From: ZEUGSWETTER Andreas IZ5 <Andreas.Zeugswetter@telecom.at>
To: 'Oleg Bartunov' <oleg@sai.msu.su>
Cc: "'hackers@postgresql.org'" <hackers@postgresql.org>
Subject: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Perhaps the locale data can be used to gather this information?

It's certainly there ! locale data contains all information about
specific character set and encoding. Is it possible to use it to create
indices ? It should be slower but benefits of using indices will cover
expenses for non-US people. I didn't notice such behaivour in Informix
and Oracle.

Informix has the national character handling, and it is indexable in
Informix.
But it is not done for the standard types char and varchar. In Informix
you use nchar and nvarchar, and have one locale defined per database.
In Oracle you have national characters, but access is not indexable.
Actually the SQL standard has something to say about national char
and varchar. I think it is wrong that char and varchar change their behavior
in postgresql, if you enable locale. A locale sensitive column needs to be
specified
as such in the create table statement according to the standard.

Thanks for explanations. It would be great if I could specify for specific
columns if it needs locale. For now I have to pay decreasing in speed
just to enable locale for the only column of one database !!!
I always dream to be able to specify on fly 'SET LOCALE to ON|OFF'
It's a bit separate problem, but probably more easy to implement.

Regards,
Oleg

I never enable locale.

Andreas

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#13Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#9)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Tom Lane wrote:

Bruce Momjian <maillist@candle.pha.pa.us> writes:

This certainly explains it. With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

We don't actually need the *maximum* character value, what we need is
to be able to generate a *slightly larger* character value.

For example, what the parser is doing now:
fld LIKE 'abc%' ==> fld <= 'abc\377'
is not even really right in ASCII locale, because it will reject a
data value like 'abc\377x'.

I think what we really want is to generate the "next value of the
same length" and use a < comparison. In ASCII locale this means
fld LIKE 'abc%' ==> fld < 'abd'
which is reliable regardless of what comes after abc in the data.
The trick is to figure out a "next" value without assuming a lot
about the local character set and collation sequence.

in single-byte locales it should be easy:

1. sort a char[256] array from 0-255 using the current locale settings,
do it once, either at startup or when first needed.

2. use binary search on that array to locate the last char before %
in this sorted array:
if (it is not the last char in sorted array)
then (replace that char with the one at index+1)
else (
if (it is not the first char in like string)
then (discard the last char and goto 2.
else (don't do the end restriction)
)

some locales where the string is already sorted may use special
treatment (ASCII, CYRILLIC)

But I am worried whether this trick will work in multibyte locales ---
incrementing the last byte might generate an invalid character sequence
and produce unpredictable results from strcmp. So we need some help
from someone who knows a lot about collation orders and multibyte
character representations.

for double-byte locales something similar should work, but getting
the initial array is probably tricky

----------------
Hannu

#14Hannu Krosing
hannu@tm.ee
In reply to: Oleg Bartunov (#12)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Oleg Bartunov wrote:

Thanks for explanations. It would be great if I could specify for specific
columns if it needs locale. For now I have to pay decreasing in speed
just to enable locale for the only column of one database !!!
I always dream to be able to specify on fly 'SET LOCALE to ON|OFF'
It's a bit separate problem, but probably more easy to implement.

Actually the locale should be part of field definition (like length and
precision currently are) with default of ascii.

And also it may be probably safest to maintain our own locale defs (as
Oracle
currently does)so that things don't break if somebody changes the system
ones.

-------------
Hannu

#15Goran Thyni
goran@kirra.net
In reply to: Tom Lane (#9)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

It was me who found the bug and supplied the ugly fix, but that was the
only
quick fix we could find until someone figure out how to get the correct
one from
locale.
USE_LOCALE uses strcoll instead of strcmp for comparasion and indexes
and it sorts
correctly but does not work with the MakeIndexable trick since \377
might not
be the highest char in the alphabet or worse might not be a char of the
alphabet at
all (like in sv/fi locales).

The patch gives a slower correct result instead of a fast incorrect one,
which is better IMHO, but maybe I am old fashion. :-)

Anyway I think a correct solution should be:
"a LIKE 'abc%'" should generate:
"a >= 'abc\0' AND a < 'abd'"
^
where d is the last char before % + 1, or more correct the next char in
alphabet
after the last char before %.

Still looking for a locale guru. Hello!! :-(

regards,
--
-----------------
G�ran Thyni
This is Penguin Country. On a quiet night you can hear Windows NT
reboot!

#16Bruce Momjian
bruce@momjian.us
In reply to: Daniel Kalchev (#5)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Bruce,

This is extremely bad news, because if will make PostgreSQL 6.5 unusable for
most of my applications. Perhaps something can be done to resolve this
situation?

I understand the problem for 16-bit characters support, but for most of the
encodings that support only 8 bit characters it should be safe to assume the
maximum character value is 255.

Anyway, making this check compile-time defined would certainly fix things
here, because in my case the cyrillic letters order match that of the binary
encoding (that is, the first alphabet letter is before the second etc).

Perhaps the locale data can be used to gather this information?

I will do some testing without using locale to see what happens.

The locale check is near the bottom of parser/gram.y, so you can
certainly enable indexing there.

I am told that french does not have 255 as it's max character, and there
there is no collating interface to request the highest character.  I
suppose one hack would be to go out and test all the char values to see
which is highest.
-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#17Bruce Momjian
bruce@momjian.us
In reply to: Oleg Bartunov (#7)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

It's certainly there ! locale data contains all information about
specific character set and encoding. Is it possible to use it to create
indices ? It should be slower but benefits of using indices will cover
expenses for non-US people. I didn't notice such behaivour in Informix
and Oracle. Fixing this would be a good point in respect of popularity
of Postgres. Are there any chance to place it in TODO for 6.6 ?
At least explain should reflect such fact !

Added to TODO:

* Allow indexing of LIKE with localle character sets

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#18Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#9)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Yes. This pretty much sums up the problem.

Bruce Momjian <maillist@candle.pha.pa.us> writes:

This certainly explains it. With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

We don't actually need the *maximum* character value, what we need is
to be able to generate a *slightly larger* character value.

For example, what the parser is doing now:
fld LIKE 'abc%' ==> fld <= 'abc\377'
is not even really right in ASCII locale, because it will reject a
data value like 'abc\377x'.

I think what we really want is to generate the "next value of the
same length" and use a < comparison. In ASCII locale this means
fld LIKE 'abc%' ==> fld < 'abd'
which is reliable regardless of what comes after abc in the data.

The trick is to figure out a "next" value without assuming a lot
about the local character set and collation sequence. I had
been thinking about a brute force method: generate a string and
check to see whether strcmp claims it's greater than the original
or not; if not, increment the last byte and try again. You'd
also have to be able to back up and increment earlier bytes if
you maxed out the last one. For example, in French locale,
fld LIKE 'ab\376%'
you'd first produce 'ab\377' but discover that it's less than
'ab\376' (because \377 is y-dieresis which sorts like 'y').
Your next try must be 'ac\377' which will succeed.

But I am worried whether this trick will work in multibyte locales ---
incrementing the last byte might generate an invalid character sequence
and produce unpredictable results from strcmp. So we need some help
from someone who knows a lot about collation orders and multibyte
character representations.

regards, tom lane

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#19Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#10)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

LIKE doesn't know anything about indexes, nor vice versa. What the
index-scan machinery *does* know about is <, <=, etc. If you have
WHERE clauses like "x >= 33" and "x <= 54" then an index scan knows
to only scan the part of the index from 33 to 54. So you never even
visit a large fraction of the table. This is why an index scan can
be faster than a sequential scan even though the per-tuple overhead
of consulting the index is larger.

You know, everyone beats me up for that LIKE indexing hack, but every
month that goes by where someone does not come up with a better solution
makes me feel a little better about the criticism.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#20Bruce Momjian
bruce@momjian.us
In reply to: Hannu Krosing (#14)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Oleg Bartunov wrote:

Thanks for explanations. It would be great if I could specify for specific
columns if it needs locale. For now I have to pay decreasing in speed
just to enable locale for the only column of one database !!!
I always dream to be able to specify on fly 'SET LOCALE to ON|OFF'
It's a bit separate problem, but probably more easy to implement.

Actually the locale should be part of field definition (like length and
precision currently are) with default of ascii.

And also it may be probably safest to maintain our own locale defs (as
Oracle
currently does)so that things don't break if somebody changes the system
ones.

Added to TODO:

* Allow LOCALE on a per-column basis, default to ASCII

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#21Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Andreas Zeugswetter (#11)
#22Michael Robinson
robinson@netrinsics.com
In reply to: Thomas Lockhart (#21)
#23Daniel Kalchev
daniel@digsys.bg
In reply to: Bruce Momjian (#16)
#24Oleg Broytmann
phd@emerald.netskate.ru
In reply to: Thomas Lockhart (#21)
#25Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Oleg Broytmann (#24)
#26Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Oleg Broytmann (#24)
#27Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#26)
#28Goran Thyni
goran@kirra.net
In reply to: Tatsuo Ishii (#25)
#29Oleg Broytmann
phd@emerald.netskate.ru
In reply to: Thomas Lockhart (#26)
#30Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Goran Thyni (#28)
#31Oleg Broytmann
phd@emerald.netskate.ru
In reply to: Tatsuo Ishii (#25)
#32Dmitry Samersoff
dms@wplus.net
In reply to: Goran Thyni (#28)
#33Dmitry Samersoff
dms@wplus.net
In reply to: Oleg Broytmann (#29)
#34Oleg Broytmann
phd@emerald.netskate.ru
In reply to: Dmitry Samersoff (#33)
#35Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Oleg Broytmann (#29)
#36Oleg Broytmann
phd@emerald.netskate.ru
In reply to: Thomas Lockhart (#35)
#37Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Oleg Broytmann (#36)
#38Hannu Krosing
hannu@tm.ee
In reply to: Oleg Broytmann (#36)
#39Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Oleg Broytmann (#36)
#40Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#39)
#41Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#40)
#42Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Tatsuo Ishii (#40)
#43Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#42)
#44Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Tatsuo Ishii (#43)