Postgres 6.5 beta2 and beta3 problem

Started by Daniel Kalchevover 26 years ago44 messages
#1Daniel Kalchev
daniel@digsys.bg

Hello,

Sorry that I picked this too late in the release cycle, but other work
prevented my earlier involvement in the testing.

There are at least two serious problems that I discovered so far with Postgres
6.5 beta2 and beta3 (running on BSD/OS 4.0.1):

1. LIKE with indexes works worse than without indexes.

Given the following schema:

CREATE TABLE "words" (
"w_key" text,
"w_pages" text);
CREATE INDEX "w_k_words_i" on "words" using btree ( "w_key" "text_ops" );

The table words has 117743 records.

the folowing query:

select w_key from words where w_key like 'sometext%'

is explained as:

Index Scan using w_k_words_i on words (cost=3335.38 rows=1 width=12)

and runs for several minutes. If I drop the w_k_words_i index, the explain is:

Seq Scan on words (cost=7609.52 rows=1 width=12)

and the query runs noticeably faster.

Under 6.4 the behavior is as expected, much better with indexes.

2. Under Postgres 6.4 the following query:

SELECT config.serviceid, custid, datetime_date( updated_at ) as date ,archived
as a, c.subserviceid as ss, c.usage_price as
price, c.usage_included as time, service
FROM a, b, c
WHERE confid in ( SELECT confid
FROM a
WHERE archived_at > '30-04-1999'
AND created_at < '30-04-1999' )
AND not archived
AND a.serviceid=b.serviceid
AND c.serviceid=a.serviceid
GROUP BY custid, serviceid, subserviceid;

works, although runs for indefinitely long time (due to the subselect - but
this is not a problem, as it can be rewritten). Under Postgres 6.5 hwoever, it
is not accepted, because there are no aggregates in the target list. Is this
incorrect behavior of the 6.4.2 version or 6.5 has different syntax?

Regards,
Daniel Kalchev

#2Daniel Kalchev
daniel@digsys.bg
In reply to: Daniel Kalchev (#1)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Yes,

I do build with --enable-locale, but I do not build with --enable-mb and do
not use client_encoding or server_encoding.

The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the
environment in both server and client, and this has worked for me in 6.4.2.

Regards,
Daniel

Show quoted text

"Hiroshi Inoue" said:

-----Original Message-----
From: owner-pgsql-hackers@postgreSQL.org
[mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Daniel Kalchev
Sent: Wednesday, June 09, 1999 5:04 PM
To: pgsql-hackers@hub.org
Subject: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Hello,

Sorry that I picked this too late in the release cycle, but other work
prevented my earlier involvement in the testing.

There are at least two serious problems that I discovered so far
with Postgres
6.5 beta2 and beta3 (running on BSD/OS 4.0.1):

1. LIKE with indexes works worse than without indexes.

Did you built with --enable-locale ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#3Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Daniel Kalchev (#2)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Yes,

I do build with --enable-locale, but I do not build with --enable-mb and do
not use client_encoding or server_encoding.

The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the
environment in both server and client, and this has worked for me in 6.4.2.

This certainly explains it. With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

We didn't do the check in 6.4.*, and LIKE was not returning the proper
results for queries at those sites that used locale.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#4Oleg Bartunov
oleg@sai.msu.su
In reply to: Bruce Momjian (#3)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

On Wed, 9 Jun 1999, Bruce Momjian wrote:

Date: Wed, 9 Jun 1999 08:43:50 -0400 (EDT)
From: Bruce Momjian <maillist@candle.pha.pa.us>
To: Daniel Kalchev <daniel@digsys.bg>
Cc: Hiroshi Inoue <Inoue@tpf.co.jp>, pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Yes,

I do build with --enable-locale, but I do not build with --enable-mb and do
not use client_encoding or server_encoding.

The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the
environment in both server and client, and this has worked for me in 6.4.2.

This certainly explains it. With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

If so, why explain reports 'Index Scan ....' ?
apod=> \d
Database    = apod
 +------------------+----------------------------------+----------+
 |  Owner           |             Relation             |   Type   |
 +------------------+----------------------------------+----------+
 | megera           | idx_adate                        | index    |
 | megera           | idx_atitle                       | index    |
 | megera           | idx_url                          | index    |
 | megera           | titles                           | table    |
 +------------------+----------------------------------+----------+

apod=> explain select atitle from titles where atitle like 'Sun%';
NOTICE: QUERY PLAN:

Index Scan using idx_atitle on titles (cost=33.28 rows=1 width=12)

Current cvs, --enable-locale --with-mb=KOI8

Regards,

Oleg

We didn't do the check in 6.4.*, and LIKE was not returning the proper
results for queries at those sites that used locale.

-- 
Bruce Momjian                        |  http://www.op.net/~candle
maillist@candle.pha.pa.us            |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#5Daniel Kalchev
daniel@digsys.bg
In reply to: Bruce Momjian (#3)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Bruce,

This is extremely bad news, because if will make PostgreSQL 6.5 unusable for
most of my applications. Perhaps something can be done to resolve this
situation?

I understand the problem for 16-bit characters support, but for most of the
encodings that support only 8 bit characters it should be safe to assume the
maximum character value is 255.

Anyway, making this check compile-time defined would certainly fix things
here, because in my case the cyrillic letters order match that of the binary
encoding (that is, the first alphabet letter is before the second etc).

Perhaps the locale data can be used to gather this information?

I will do some testing without using locale to see what happens.

Regards,
Daniel

Bruce Momjian said:

Yes,

I do build with --enable-locale, but I do not build with --enable-mb and d

o

not use client_encoding or server_encoding.

The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the
environment in both server and client, and this has worked for me in 6.4.2

.

Show quoted text

This certainly explains it. With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

We didn't do the check in 6.4.*, and LIKE was not returning the proper
results for queries at those sites that used locale.

-- 
Bruce Momjian                        |  http://www.op.net/~candle
maillist@candle.pha.pa.us            |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#6Daniel Kalchev
daniel@digsys.bg
In reply to: Daniel Kalchev (#5)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Further to my cyrillic charset (--enable-locale) introduced woes, I would like
to add, that when recompiling Postgres 6.5 without --enable-locale now the
query runs reasonably fast and explain gives:

Index Scan using w_k_words_i on words (cost=1112.13 rows=1 width=12)

instead of

Index Scan using w_k_words_i on words (cost=3335.38 rows=1 width=12)

The problem as it seems is that the restrictions to use indexes when locale is
enabled are not consistently applied - explain shows that indices will be
used, and the behavior with indexes and without indexes is different (with
indexes it's noticeably slower :-) so indexes are apparently being used...

Apparently (for my current tests at least) the Bulgarian cyrillic
(windows-1251) is handled reasonably well without locale support - untill now
~* didn't work anyway.

Daniel Kalchev

#7Oleg Bartunov
oleg@sai.msu.su
In reply to: Daniel Kalchev (#5)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

On Wed, 9 Jun 1999, Daniel Kalchev wrote:

Date: Wed, 09 Jun 1999 16:15:58 +0300
From: Daniel Kalchev <daniel@digsys.bg>
To: Bruce Momjian <maillist@candle.pha.pa.us>
Cc: pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Bruce,

This is extremely bad news, because if will make PostgreSQL 6.5 unusable for
most of my applications. Perhaps something can be done to resolve this
situation?

I understand the problem for 16-bit characters support, but for most of the
encodings that support only 8 bit characters it should be safe to assume the
maximum character value is 255.

Anyway, making this check compile-time defined would certainly fix things
here, because in my case the cyrillic letters order match that of the binary
encoding (that is, the first alphabet letter is before the second etc).

Perhaps the locale data can be used to gather this information?

It's certainly there ! locale data contains all information about
specific character set and encoding. Is it possible to use it to create
indices ? It should be slower but benefits of using indices will cover
expenses for non-US people. I didn't notice such behaivour in Informix
and Oracle. Fixing this would be a good point in respect of popularity
of Postgres. Are there any chance to place it in TODO for 6.6 ?
At least explain should reflect such fact !

Regards,

Oleg

I will do some testing without using locale to see what happens.

Regards,
Daniel

Bruce Momjian said:

Yes,

I do build with --enable-locale, but I do not build with --enable-mb and d

o

not use client_encoding or server_encoding.

The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the
environment in both server and client, and this has worked for me in 6.4.2

.

This certainly explains it. With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

We didn't do the check in 6.4.*, and LIKE was not returning the proper
results for queries at those sites that used locale.

-- 
Bruce Momjian                        |  http://www.op.net/~candle
maillist@candle.pha.pa.us            |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Oleg Bartunov (#7)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Daniel Kalchev <daniel@digsys.bg> writes:

1. LIKE with indexes works worse than without indexes.

Since you are using USE_LOCALE, the parser is inserting only a one-sided
index restriction; that is
WHERE w_key like 'sometext%'
becomes
WHERE w_key like 'sometext%' AND w_key >= 'sometext'
whereas without USE_LOCALE it becomes
WHERE w_key like 'sometext%' AND w_key >= 'sometext'
AND w_key <= 'sometext\377'
6.4 always did the latter, which was wrong in non-ASCII locales because
\377 might not be the highest character in the sort order. (Strictly
speaking it's wrong in ASCII locale as well...)

Of course, the one-sided index restriction is much less selective than
the two-sided; depending on what 'sometext' actually is, you might end
up scanning most of the table, and since index scan is much slower
per-tuple-scanned than sequential scan, you lose. That's evidently
what's happening here.

I suspect that the optimizer's cost estimates need refinement;
it should be able to guess that the sequential scan will be the
faster choice here.

Of course what you really want is a two-sided index restriction,
but we are not going to be able to fix that until someone figures
out a locale-independent way of generating a "slightly larger"
comparison string. So far I have not heard any proposals that
sound like they will work...

Under Postgres 6.5 hwoever, it
is not accepted, because there are no aggregates in the target list.

No, that's not what it's unhappy about; it's unhappy because there
are ungrouped fields used in the target list. This is erroneous
SQL because there's no unique choice of value to return for such an
field (if several tuples are grouped together, which one's value
of the field do you use?) Prior versions of Postgres failed to detect
this error, but it's an error. You were getting randomly selected
values for the ungrouped fields, I suppose.

regards, tom lane

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#8)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Bruce Momjian <maillist@candle.pha.pa.us> writes:

This certainly explains it. With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

We don't actually need the *maximum* character value, what we need is
to be able to generate a *slightly larger* character value.

For example, what the parser is doing now:
fld LIKE 'abc%' ==> fld <= 'abc\377'
is not even really right in ASCII locale, because it will reject a
data value like 'abc\377x'.

I think what we really want is to generate the "next value of the
same length" and use a < comparison. In ASCII locale this means
fld LIKE 'abc%' ==> fld < 'abd'
which is reliable regardless of what comes after abc in the data.

The trick is to figure out a "next" value without assuming a lot
about the local character set and collation sequence. I had
been thinking about a brute force method: generate a string and
check to see whether strcmp claims it's greater than the original
or not; if not, increment the last byte and try again. You'd
also have to be able to back up and increment earlier bytes if
you maxed out the last one. For example, in French locale,
fld LIKE 'ab\376%'
you'd first produce 'ab\377' but discover that it's less than
'ab\376' (because \377 is y-dieresis which sorts like 'y').
Your next try must be 'ac\377' which will succeed.

But I am worried whether this trick will work in multibyte locales ---
incrementing the last byte might generate an invalid character sequence
and produce unpredictable results from strcmp. So we need some help
from someone who knows a lot about collation orders and multibyte
character representations.

regards, tom lane

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#9)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Daniel Kalchev <daniel@digsys.bg> writes:

The problem as it seems is that the restrictions to use indexes when
locale is enabled are not consistently applied - explain shows that
indices will be used, and the behavior with indexes and without
indexes is different (with indexes it's noticeably slower :-) so
indexes are apparently being used...

Right, but what EXPLAIN doesn't show you (unless you can read the
much uglier EXPLAIN VERBOSE output) is what index restrictions are
being used.

LIKE doesn't know anything about indexes, nor vice versa. What the
index-scan machinery *does* know about is <, <=, etc. If you have
WHERE clauses like "x >= 33" and "x <= 54" then an index scan knows
to only scan the part of the index from 33 to 54. So you never even
visit a large fraction of the table. This is why an index scan can
be faster than a sequential scan even though the per-tuple overhead
of consulting the index is larger.

So, there is a special hack in the parser for LIKE (also for regexp
matches): if the parser sees that the match pattern has a fixed initial
substring, it inserts some >= and <= clauses that are designed to
exploit what the index scanner can do.

Our immediate problem is that we had to drop the <= clause in non-ASCII
locales because it was wrong. So now an index scan driven by LIKE
restrictions is not nearly as restrictive as it was, and has to visit
many tuples (about half the table on average) whereas before it was
likely to visit only a few, if you had a reasonably long fixed initial
string.

There are some other problems (notably, that the extra clauses are
inserted even if there's no index and thus no way that they will be
helpful) but those we know how to fix, and I hope to address them for
6.6. Fixing the <= problem requires knowledge about non-ASCII character
sets, and I for one don't know enough to fix it...

regards, tom lane

#11ZEUGSWETTER Andreas IZ5
Andreas.Zeugswetter@telecom.at
In reply to: Tom Lane (#10)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Perhaps the locale data can be used to gather this information?

It's certainly there ! locale data contains all information about
specific character set and encoding. Is it possible to use it to create
indices ? It should be slower but benefits of using indices will cover
expenses for non-US people. I didn't notice such behaivour in Informix
and Oracle.

Informix has the national character handling, and it is indexable in
Informix.
But it is not done for the standard types char and varchar. In Informix
you use nchar and nvarchar, and have one locale defined per database.
In Oracle you have national characters, but access is not indexable.
Actually the SQL standard has something to say about national char
and varchar. I think it is wrong that char and varchar change their behavior
in postgresql, if you enable locale. A locale sensitive column needs to be
specified
as such in the create table statement according to the standard.
I never enable locale.

Andreas

#12Oleg Bartunov
oleg@sai.msu.su
In reply to: ZEUGSWETTER Andreas IZ5 (#11)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

On Wed, 9 Jun 1999, ZEUGSWETTER Andreas IZ5 wrote:

Date: Wed, 9 Jun 1999 19:16:14 +0200
From: ZEUGSWETTER Andreas IZ5 <Andreas.Zeugswetter@telecom.at>
To: 'Oleg Bartunov' <oleg@sai.msu.su>
Cc: "'hackers@postgresql.org'" <hackers@postgresql.org>
Subject: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Perhaps the locale data can be used to gather this information?

It's certainly there ! locale data contains all information about
specific character set and encoding. Is it possible to use it to create
indices ? It should be slower but benefits of using indices will cover
expenses for non-US people. I didn't notice such behaivour in Informix
and Oracle.

Informix has the national character handling, and it is indexable in
Informix.
But it is not done for the standard types char and varchar. In Informix
you use nchar and nvarchar, and have one locale defined per database.
In Oracle you have national characters, but access is not indexable.
Actually the SQL standard has something to say about national char
and varchar. I think it is wrong that char and varchar change their behavior
in postgresql, if you enable locale. A locale sensitive column needs to be
specified
as such in the create table statement according to the standard.

Thanks for explanations. It would be great if I could specify for specific
columns if it needs locale. For now I have to pay decreasing in speed
just to enable locale for the only column of one database !!!
I always dream to be able to specify on fly 'SET LOCALE to ON|OFF'
It's a bit separate problem, but probably more easy to implement.

Regards,
Oleg

I never enable locale.

Andreas

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#13Hannu Krosing
hannu@trust.ee
In reply to: Tom Lane (#9)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Tom Lane wrote:

Bruce Momjian <maillist@candle.pha.pa.us> writes:

This certainly explains it. With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

We don't actually need the *maximum* character value, what we need is
to be able to generate a *slightly larger* character value.

For example, what the parser is doing now:
fld LIKE 'abc%' ==> fld <= 'abc\377'
is not even really right in ASCII locale, because it will reject a
data value like 'abc\377x'.

I think what we really want is to generate the "next value of the
same length" and use a < comparison. In ASCII locale this means
fld LIKE 'abc%' ==> fld < 'abd'
which is reliable regardless of what comes after abc in the data.
The trick is to figure out a "next" value without assuming a lot
about the local character set and collation sequence.

in single-byte locales it should be easy:

1. sort a char[256] array from 0-255 using the current locale settings,
do it once, either at startup or when first needed.

2. use binary search on that array to locate the last char before %
in this sorted array:
if (it is not the last char in sorted array)
then (replace that char with the one at index+1)
else (
if (it is not the first char in like string)
then (discard the last char and goto 2.
else (don't do the end restriction)
)

some locales where the string is already sorted may use special
treatment (ASCII, CYRILLIC)

But I am worried whether this trick will work in multibyte locales ---
incrementing the last byte might generate an invalid character sequence
and produce unpredictable results from strcmp. So we need some help
from someone who knows a lot about collation orders and multibyte
character representations.

for double-byte locales something similar should work, but getting
the initial array is probably tricky

----------------
Hannu

#14Hannu Krosing
hannu@trust.ee
In reply to: Oleg Bartunov (#12)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Oleg Bartunov wrote:

Thanks for explanations. It would be great if I could specify for specific
columns if it needs locale. For now I have to pay decreasing in speed
just to enable locale for the only column of one database !!!
I always dream to be able to specify on fly 'SET LOCALE to ON|OFF'
It's a bit separate problem, but probably more easy to implement.

Actually the locale should be part of field definition (like length and
precision currently are) with default of ascii.

And also it may be probably safest to maintain our own locale defs (as
Oracle
currently does)so that things don't break if somebody changes the system
ones.

-------------
Hannu

#15Goran Thyni
goran@kirra.net
In reply to: Tom Lane (#9)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

It was me who found the bug and supplied the ugly fix, but that was the
only
quick fix we could find until someone figure out how to get the correct
one from
locale.
USE_LOCALE uses strcoll instead of strcmp for comparasion and indexes
and it sorts
correctly but does not work with the MakeIndexable trick since \377
might not
be the highest char in the alphabet or worse might not be a char of the
alphabet at
all (like in sv/fi locales).

The patch gives a slower correct result instead of a fast incorrect one,
which is better IMHO, but maybe I am old fashion. :-)

Anyway I think a correct solution should be:
"a LIKE 'abc%'" should generate:
"a >= 'abc\0' AND a < 'abd'"
^
where d is the last char before % + 1, or more correct the next char in
alphabet
after the last char before %.

Still looking for a locale guru. Hello!! :-(

regards,
--
-----------------
G�ran Thyni
This is Penguin Country. On a quiet night you can hear Windows NT
reboot!

#16Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Daniel Kalchev (#5)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Bruce,

This is extremely bad news, because if will make PostgreSQL 6.5 unusable for
most of my applications. Perhaps something can be done to resolve this
situation?

I understand the problem for 16-bit characters support, but for most of the
encodings that support only 8 bit characters it should be safe to assume the
maximum character value is 255.

Anyway, making this check compile-time defined would certainly fix things
here, because in my case the cyrillic letters order match that of the binary
encoding (that is, the first alphabet letter is before the second etc).

Perhaps the locale data can be used to gather this information?

I will do some testing without using locale to see what happens.

The locale check is near the bottom of parser/gram.y, so you can
certainly enable indexing there.

I am told that french does not have 255 as it's max character, and there
there is no collating interface to request the highest character.  I
suppose one hack would be to go out and test all the char values to see
which is highest.
-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#17Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Oleg Bartunov (#7)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

It's certainly there ! locale data contains all information about
specific character set and encoding. Is it possible to use it to create
indices ? It should be slower but benefits of using indices will cover
expenses for non-US people. I didn't notice such behaivour in Informix
and Oracle. Fixing this would be a good point in respect of popularity
of Postgres. Are there any chance to place it in TODO for 6.6 ?
At least explain should reflect such fact !

Added to TODO:

* Allow indexing of LIKE with localle character sets

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#18Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#9)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Yes. This pretty much sums up the problem.

Bruce Momjian <maillist@candle.pha.pa.us> writes:

This certainly explains it. With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

We don't actually need the *maximum* character value, what we need is
to be able to generate a *slightly larger* character value.

For example, what the parser is doing now:
fld LIKE 'abc%' ==> fld <= 'abc\377'
is not even really right in ASCII locale, because it will reject a
data value like 'abc\377x'.

I think what we really want is to generate the "next value of the
same length" and use a < comparison. In ASCII locale this means
fld LIKE 'abc%' ==> fld < 'abd'
which is reliable regardless of what comes after abc in the data.

The trick is to figure out a "next" value without assuming a lot
about the local character set and collation sequence. I had
been thinking about a brute force method: generate a string and
check to see whether strcmp claims it's greater than the original
or not; if not, increment the last byte and try again. You'd
also have to be able to back up and increment earlier bytes if
you maxed out the last one. For example, in French locale,
fld LIKE 'ab\376%'
you'd first produce 'ab\377' but discover that it's less than
'ab\376' (because \377 is y-dieresis which sorts like 'y').
Your next try must be 'ac\377' which will succeed.

But I am worried whether this trick will work in multibyte locales ---
incrementing the last byte might generate an invalid character sequence
and produce unpredictable results from strcmp. So we need some help
from someone who knows a lot about collation orders and multibyte
character representations.

regards, tom lane

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#19Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#10)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

LIKE doesn't know anything about indexes, nor vice versa. What the
index-scan machinery *does* know about is <, <=, etc. If you have
WHERE clauses like "x >= 33" and "x <= 54" then an index scan knows
to only scan the part of the index from 33 to 54. So you never even
visit a large fraction of the table. This is why an index scan can
be faster than a sequential scan even though the per-tuple overhead
of consulting the index is larger.

You know, everyone beats me up for that LIKE indexing hack, but every
month that goes by where someone does not come up with a better solution
makes me feel a little better about the criticism.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#20Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Hannu Krosing (#14)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Oleg Bartunov wrote:

Thanks for explanations. It would be great if I could specify for specific
columns if it needs locale. For now I have to pay decreasing in speed
just to enable locale for the only column of one database !!!
I always dream to be able to specify on fly 'SET LOCALE to ON|OFF'
It's a bit separate problem, but probably more easy to implement.

Actually the locale should be part of field definition (like length and
precision currently are) with default of ascii.

And also it may be probably safest to maintain our own locale defs (as
Oracle
currently does)so that things don't break if somebody changes the system
ones.

Added to TODO:

* Allow LOCALE on a per-column basis, default to ASCII

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#21Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: ZEUGSWETTER Andreas IZ5 (#11)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Actually the SQL standard has something to say about national char
and varchar. I think it is wrong that char and varchar change their
behavior in postgresql, if you enable locale. A locale sensitive column
needs to be specified as such in the create table statement according to
the standard. I never enable locale.

I had some discussions on the list a while ago (6 months?) about this
topic, but never really got feedback from "locale-using" people that
NATIONAL CHARACTER and collation sequences are an acceptable solution.
istm that Postgres' extensibility would make this *very* easy to
implement and extend, and that then everyone would get the same
behavior from CHAR while being able to get the behaviors they need
from a variety of other character sets.

I do have an interest in implementing or helping with something, but
since I don't have to live with the consequences of the results
(coming from an ASCII country :) it seemed to be poor form to push it
without feedback from others...

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

#22Michael Robinson
robinson@netrinsics.com
In reply to: Thomas Lockhart (#21)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Hannu Krosing <hannu@trust.ee> writes:

in single-byte locales it should be easy:

If you do it right, in double-byte locales as well.

1. sort a char[256] array from 0-255 using the current locale settings,
do it once, either at startup or when first needed.

Or, alternatively, maintain per-locale table files and mmap them.

2. use binary search on that array to locate the last char before %
in this sorted array:

Or, alternatively, maintain another table that maps char values to
lexicographic order (and a per-locale constant for maximum character
order value):

if (it is not the last char in sorted array)
then (replace that char with the one at index+1)
else (
if (it is not the first char in like string)
then (discard the last char and goto 2.
else (don't do the end restriction)
)

if ( (nextindex = charorder[c]+1) <= maxchar ) {
nextchar = charmap[nextindex];
} else {
no nextchar, append charmap[maxchar] to LIKE base string
}

I don't see where the pain is, but I may be missing something.

-Michael Robinson

#23Daniel Kalchev
daniel@digsys.bg
In reply to: Bruce Momjian (#16)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Bruce Momjian said:

The locale check is near the bottom of parser/gram.y, so you can
certainly enable indexing there.

I commented the two ifdefs that had USE_LOCALE in gram.y and now like uses
indexes for me...

I am told that french does not have 255 as it's max character, and there
there is no collating interface to request the highest character. I
suppose one hack would be to go out and test all the char values to see
which is highest.

As I understand the current 'non-locale' way the like indexing works, it
should create this problem for the Bulgarian cyrillic (cp1251) too, because
character code 255 is valid cyrillic character (the last one from the
alphabet). Therefore, the solution proposed by Hannu Krosing should be
implemented.

I believe we can make the assumption, that if you --enable-locale, but do not
use --with-mb, then you are using single-byte locales and therefore the hack
might work properly. If you use --with-mb you are out of luck until someone
explains better how multibyte characters are ordered.

Is there other place than gram.y where this hack needs to be implemented?

While I am bashing the locale support... why the ~* operator does not work
with locales? That is, I need to use construct like

SELECT key from t WHERE upper(a) ~ upper('somestring');

instead of

SELECT key FROM t WHERE a ~* 'somestring';

Or store everything in the database in uppercase (works for keys) and upper
the string in the frontend. The result is that this construct does not use
indices. We also cannot create indexes by upper(a). I believe this was
outstanding problem in pre-6.4.

I found this later problem to be resolved by modifying the
backend/regex/makefile to add -funsigned-char to CFLAGS. This is under BSD/OS
4.0.1 - I found out, that by default characters that are 'alpha' in cyrillic
are threated by the compiler as negative and therefore isalpha() returns
zero... I believe this should be fixed as it may be causing other problems
with non-ASCII locales.

My proposal is to add -funsigned-char to all places where routines such as
'isalpha' are used, and ifdef it for USE_LOCALE.

Daniel

#24Oleg Broytmann
phd@emerald.netskate.ru
In reply to: Thomas Lockhart (#21)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

On Thu, 10 Jun 1999, Thomas Lockhart wrote:

I had some discussions on the list a while ago (6 months?) about this
topic, but never really got feedback from "locale-using" people that
NATIONAL CHARACTER and collation sequences are an acceptable solution.

What feedback do you want? I am pretty sure two Olegs on this list are
ready to continue the discussion.

istm that Postgres' extensibility would make this *very* easy to
implement and extend, and that then everyone would get the same
behavior from CHAR while being able to get the behaviors they need
from a variety of other character sets.

I do have an interest in implementing or helping with something, but
since I don't have to live with the consequences of the results
(coming from an ASCII country :) it seemed to be poor form to push it
without feedback from others...

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

Oleg.
----
Oleg Broytmann http://members.xoom.com/phd2/ phd2@earthling.net
Programmers don't die, they just GOSUB without RETURN.

#25Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Oleg Broytmann (#24)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

I believe we can make the assumption, that if you --enable-locale, but do not
use --with-mb, then you are using single-byte locales and therefore the hack
might work properly.

I believe at least one person uses both --enable-locale and --with-mb
for single byte locale.

If you use --with-mb you are out of luck until someone
explains better how multibyte characters are ordered.

I think until NATIONAL CHARACTER is fully implemented, we would not be
able to properly handle sort orders with multi-byte characters.
(Thomas, I've been think about implementing NCHAR too!)

So you could do whatever you like as long as they are in #ifndef
MULTIBYTE:-)

I found this later problem to be resolved by modifying the
backend/regex/makefile to add -funsigned-char to CFLAGS. This is under BSD/OS
4.0.1 - I found out, that by default characters that are 'alpha' in cyrillic
are threated by the compiler as negative and therefore isalpha() returns
zero... I believe this should be fixed as it may be causing other problems
with non-ASCII locales.

My proposal is to add -funsigned-char to all places where routines such as
'isalpha' are used, and ifdef it for USE_LOCALE.

Once I propsed another solution to BSD/OS guy (I don't remember who he
was) and asked him to check if it worked. Unfortunately I got no
answer from him. Then I merged it into #ifdef MULTIBYTE sections in
regcomp.c and it seems work at least for cyrillic locale (Thanks Oleg
for testing!).
---
Tatsuo Ishii

#26Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Oleg Broytmann (#24)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

What feedback do you want? I am pretty sure two Olegs on this list are
ready to continue the discussion.

istm that the Russian and Japanese contingents could represent the
needs of multibyte and locale concerns very well. So, we should ask
ourselves some questions to see if we can make *progress* in evolving
our text handling, rather than just staying the same forever.

SQL92 suggests some specific text handling features to help with
non-ascii applications. "national character" is, afaik, the feature
which would hold an installation-wide local text type. "collations"
would allow other text types in the same installation, but SQL92 is a
bit fuzzier about how to make them work.

Would these mechanisms work for people? Or are they so fundamentally
flawed or non-standard (it is from a standard, but I'm not sure who
implements it)?

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

#27Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#26)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

istm that the Russian and Japanese contingents could represent the
needs of multibyte and locale concerns very well. So, we should ask
ourselves some questions to see if we can make *progress* in evolving
our text handling, rather than just staying the same forever.

SQL92 suggests some specific text handling features to help with
non-ascii applications. "national character" is, afaik, the feature
which would hold an installation-wide local text type. "collations"
would allow other text types in the same installation, but SQL92 is a
bit fuzzier about how to make them work.

Would these mechanisms work for people? Or are they so fundamentally
flawed or non-standard (it is from a standard, but I'm not sure who
implements it)?

In my opinion, introducing NCHAR is useful at least for single byte
codes. Although I'm not familiar with single byte languages, I see
strong demands for NCHAR through recent discussions.

I don't mean it's useless for multibyte, however. As far as I know
locales for multibyte provided by any OS are totally broken especially
in COLLATE(I seriously doubt existing locale framework work for
multibyte). It would be nice to have our own locale data for COLLATE
somewhere in our system like some commercial dbms do, or even better
user defined collations allowed (this is already in the standard).

I have a feeling that I'm going to implement CREATE CHARSET through
slightly modifying the multibyte support code that currently only
allows predefined charset. That will be the first step toward CREATE
COLLATE, NCHAR etc....
---
Tatsuo Ishii

#28Goran Thyni
goran@kirra.net
In reply to: Tatsuo Ishii (#25)
locales and MB (was: Postgres 6.5 beta2 and beta3 problem)

Tatsuo Ishii wrote:

I think until NATIONAL CHARACTER is fully implemented, we would not be
able to properly handle sort orders with multi-byte characters.
(Thomas, I've been think about implementing NCHAR too!)

So you could do whatever you like as long as they are in #ifndef
MULTIBYTE:-)

Hmm,
and the world is moving towards Unicode.
we definitely need working support for locales and multichar.
Perhaps the right way to do it is to store everything in
som Unicode format internally and convert the output
according to the suggested "per-column-locale-definition".

Larry Wall has hacked UTF8 support into perl so it is doable,
so let see what we can do for 6.6 (time permitting as always).

regards,
G�ran

#29Oleg Broytmann
phd@emerald.netskate.ru
In reply to: Thomas Lockhart (#26)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

On Thu, 10 Jun 1999, Thomas Lockhart wrote:

istm that the Russian and Japanese contingents could represent the
needs of multibyte and locale concerns very well. So, we should ask
ourselves some questions to see if we can make *progress* in evolving
our text handling, rather than just staying the same forever.

Ok, we are here.
And what a pros and cons for NCHAR?

SQL92 suggests some specific text handling features to help with
non-ascii applications. "national character" is, afaik, the feature

What the help?

which would hold an installation-wide local text type. "collations"
would allow other text types in the same installation, but SQL92 is a
bit fuzzier about how to make them work.

Would these mechanisms work for people? Or are they so fundamentally
flawed or non-standard (it is from a standard, but I'm not sure who
implements it)?

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

Oleg.
----
Oleg Broytmann http://members.xoom.com/phd2/ phd2@earthling.net
Programmers don't die, they just GOSUB without RETURN.

#30Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Goran Thyni (#28)
Re: locales and MB (was: Postgres 6.5 beta2 and beta3 problem)

and the world is moving towards Unicode.
we definitely need working support for locales and multichar.
Perhaps the right way to do it is to store everything in
som Unicode format internally and convert the output
according to the suggested "per-column-locale-definition".

No. There's nothing perfect in the world. Unicode is not the
exception too. So we need to keep the freedom of choice of the
internal encodings. Currently the mb support allows serveral internal
encodings including Unicode and mule-internal-code.
(yes, you can do regexp/like to Unicode data if mb support is
enabled).
--
Tatsuo Ishii

#31Oleg Broytmann
phd@emerald.netskate.ru
In reply to: Tatsuo Ishii (#25)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Hi!

On Thu, 10 Jun 1999, Tatsuo Ishii wrote:

I believe we can make the assumption, that if you --enable-locale, but do not
use --with-mb, then you are using single-byte locales and therefore the hack
might work properly.

I believe at least one person uses both --enable-locale and --with-mb
for single byte locale.

I think it's me. But very soon all in Russia will use locale+mb for koi8
and win1251 locales - all of us are always in need to convert between them.
Making Postgres convert on-the-fly is big win.

Once I propsed another solution to BSD/OS guy (I don't remember who he
was) and asked him to check if it worked. Unfortunately I got no
answer from him. Then I merged it into #ifdef MULTIBYTE sections in

I remember he raised the issue few times, and finally we agreed on
-funsigned-char. I think, he got a working Postgres and locale after all.

regcomp.c and it seems work at least for cyrillic locale (Thanks Oleg
for testing!).

Works now pretty good!

---
Tatsuo Ishii

Oleg.
----
Oleg Broytmann http://members.xoom.com/phd2/ phd2@earthling.net
Programmers don't die, they just GOSUB without RETURN.

#32Dmitry Samersoff
dms@wplus.net
In reply to: Goran Thyni (#28)
RE: [HACKERS] locales and MB (was: Postgres 6.5 beta2 and beta3

On 11-Jun-99 Goran Thyni wrote:

Tatsuo Ishii wrote:

I think until NATIONAL CHARACTER is fully implemented, we would not be
able to properly handle sort orders with multi-byte characters.
(Thomas, I've been think about implementing NCHAR too!)

So you could do whatever you like as long as they are in #ifndef
MULTIBYTE:-)

Hmm,
and the world is moving towards Unicode.
we definitely need working support for locales and multichar.
Perhaps the right way to do it is to store everything in
som Unicode format internally and convert the output
according to the suggested "per-column-locale-definition".

Larry Wall has hacked UTF8 support into perl so it is doable,
so let see what we can do for 6.6 (time permitting as always).

IMHO, also will be pleasent add charset/unicode for database or possible, for
single table at runtime, not for the whole postgres by configure.

---
Dmitry Samersoff, dms@wplus.net, ICQ:3161705
http://devnull.wplus.net
* There will come soft rains ...

#33Dmitry Samersoff
dms@wplus.net
In reply to: Oleg Broytmann (#29)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

On 11-Jun-99 Oleg Broytmann wrote:

On Thu, 10 Jun 1999, Thomas Lockhart wrote:

istm that the Russian and Japanese contingents could represent the
needs of multibyte and locale concerns very well. So, we should ask
ourselves some questions to see if we can make *progress* in evolving
our text handling, rather than just staying the same forever.

Ok, we are here.
And what a pros and cons for NCHAR?

SQL92 suggests some specific text handling features to help with
non-ascii applications. "national character" is, afaik, the feature

What the help?

which would hold an installation-wide local text type. "collations"
would allow other text types in the same installation, but SQL92 is a
bit fuzzier about how to make them work.

Would these mechanisms work for people? Or are they so fundamentally
flawed or non-standard (it is from a standard, but I'm not sure who
implements it)?

There are two different problems under "locale" cover.

First is national charset handling, mostly sorting,
I try to find official recomendation for it.

Another one is encoding conversion -
it's unlimited field for discussion.

IMHO, conversion table have to be stored on per host basic,
some where near from HBA.

Sort table is to be placed near from createdb.
Custom sort table is also very attractive for me.

---
Dmitry Samersoff, dms@wplus.net, ICQ:3161705
http://devnull.wplus.net
* There will come soft rains ...

#34Oleg Broytmann
phd@emerald.netskate.ru
In reply to: Dmitry Samersoff (#33)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Hi!

On Fri, 11 Jun 1999, Dmitry Samersoff wrote:

IMHO, conversion table have to be stored on per host basic,
some where near from HBA.

Currently you can do this with per-host (SQL command) SET
CLIENT_ENCODING or (env var) export PGCLIENTENCODING=...

---
Dmitry Samersoff, dms@wplus.net, ICQ:3161705
http://devnull.wplus.net
* There will come soft rains ...

Oleg.
----
Oleg Broytmann http://members.xoom.com/phd2/ phd2@earthling.net
Programmers don't die, they just GOSUB without RETURN.

#35Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Oleg Broytmann (#29)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

istm that the Russian and Japanese contingents could represent the
needs of multibyte and locale concerns very well. So, we should ask
ourselves some questions to see if we can make *progress* in evolving
our text handling, rather than just staying the same forever.

Ok, we are here.
And what a pros and cons for NCHAR?

I was hoping you would tell me! :)

SQL92 suggests some specific text handling features to help with
non-ascii applications.

What the help?

OK, SQL92 defines two kinds of native character sets: those we already
have (char, varchar) and those which can be locale customized (nchar,
national character varying, and others). char and varchar always
default to the "SQL" behavior (which I think corresponds to ascii
(called "SQL_TEXT") but I didn't bother looking for the details).

So, at its simplest, there would be two sets of character types, with
char, varchar, etc, always the same on every system (just like
Postgres w/o multibyte or locale), and nchar, nvarchar, etc configured
as your locale/multibyte environment would want.

However, there are many more features in SQL92 to deal with text
customization. I'll mention a few (well, most of them, but not in
detail):

o You can define a "character set" and, independently, a "collation".
The syntax for the type definition is
CHARACTER [ VARYING ] [ (length) ]
[ CHARACTER SET your-character-set ]
[ COLLATE your-collation-sequence ]

o You can specify a character type for string literals:
_your-character-set 'literal string' e.g. _ESPANOL 'Que pasa?'
(forgive my omission of a leading upside down question mark :)
We already have some support for this in that character string
literals can have a type specification (e.g. "DATETIME 'now'") and
presumably we can use the required underscore to convert the
"_ESPANOL" to a character set and collation, all within the existing
Postgres type system.

o You can specify collation behavior in a strange way:
'Yo dude!' COLLATE collation-method
(which we could convert in the parser to a function call).

o You can translate between character sets, *if* there is a reasonable
mapping available:
TRANSLATE(string USING method)
and you can define translations in a vague way (since no one but
Postgres implemented a type system back then):
CREATE TRANSLATION translation
FOR source-charset
TO target-charset
FROM { EXTERNAL('external-translation') | IDENTITY |
existing-translation }
DROP TRANSLATION translation

o You can convert character strings which have the same character
"repertoire" from one to the other:
CONVERT(string USING conversion-method)
(e.g. we could define a method "EBCDIC_TO_ASCII" once we have an
"EBCDIC" character set).

o You can specify identifiers (column names, etc) with a specific
character set/collation by:
_charset colname (e.g. _FRANCAIS Francais where the second "c" is
allowed to be "c-cedilla", a character in the French/latin character
set; sorry I didn't type it).

Would these mechanisms work for people? Or are they so fundamentally
flawed or non-standard (it is from a standard, but I'm not sure who
implements it)?

Fully implementing these features (or a reasonable subset) would give
us more capabilities than we have now, and imho can be fit into our
existing type system. *Only* implementing NCHAR etc gives us the
ability to carry SQL_TEXT and multibyte/locale types in the same
database, which may not be a huge benefit to those who never want to
mix them in the same installation. I don't know who those folks might
be but Tatsuo and yourself probably do.

Comments?

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

#36Oleg Broytmann
phd@emerald.netskate.ru
In reply to: Thomas Lockhart (#35)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Hi!

On Fri, 11 Jun 1999, Thomas Lockhart wrote:

And what a pros and cons for NCHAR?

I was hoping you would tell me! :)

I can see only one advantage for NCHAR - those fields that aren't NCHAR
will not use strcoll() for comparison.
But I cannot remember one filed in my database that does not contain
russian characters. Even my WWW logs contain them.
So in any case I am forced to make all my fields NCHAR, and this is
exactly what we have now - postgres compiled with --enable-locale makes all
char NCHAR.

- Thomas
--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

Oleg.
----
Oleg Broytmann http://members.xoom.com/phd2/ phd2@earthling.net
Programmers don't die, they just GOSUB without RETURN.

#37Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Oleg Broytmann (#36)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

I can see only one advantage for NCHAR - those fields that aren't NCHAR
will not use strcoll() for comparison.
But I cannot remember one filed in my database that does not contain
russian characters. Even my WWW logs contain them.
So in any case I am forced to make all my fields NCHAR, and this is
exactly what we have now - postgres compiled with --enable-locale makes
all char NCHAR.

Yes, and that is how we got the implementation we have. Implementing
NCHAR is a step on the road toward having fully flexible character set
capabilities in a single database. By itself, NCHAR probably does not
offer tremendous advantages for anyone running a fully "localized"
database.

So some the questions really might be:
1) Is implementing NCHAR, and reverting CHAR back to the SQL-standard
ascii-ish behavior, acceptable, or does it introduce fatal flaws for
implementers? e.g. do any third party tools know about NCHAR? I would
assume that the odbc interface could just map NCHAR to CHAR if odbc
knows nothing about NCHAR...

2) Solving various problems for specific datasets will require new
specialized support routines. If this is true, then isn't the Postgres
type system the way to introduce these specialized capabilities?
Doesn't Unicode, for example, work well as a new data type, as opposed
to shoehorning it into all areas of the backend with #ifdefs?

3) Do the SQL92-defined features help us solve the problem, or do they
just get in the way? istm that they address some of the features we
would need, and have sufficient fuzz around the edges to allow a
successful implementation.

An example of what we could do would be to have both Russian/Cyrillic
and Japanese regression tests in the main regression suite, since they
could coexist with the other tests.

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

#38Hannu Krosing
hannu@trust.ee
In reply to: Oleg Broytmann (#36)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Oleg Broytmann wrote:

Hi!

On Fri, 11 Jun 1999, Thomas Lockhart wrote:

And what a pros and cons for NCHAR?

I was hoping you would tell me! :)

I can see only one advantage for NCHAR - those fields that aren't NCHAR
will not use strcoll() for comparison.
But I cannot remember one filed in my database that does not contain
russian characters. Even my WWW logs contain them.

what about the tables beginning with pg_ ?

Are the system tables currently affected by --enable-locale ?

So in any case I am forced to make all my fields NCHAR, and this is
exactly what we have now - postgres compiled with --enable-locale makes all
char NCHAR.

Well, the problem is that while I do occasionally need cyrillic chars,
I also need English, Estonian, Finnish/Swedish, Latvian and Lithuanian.

The only two of them that don't have overlapping character codes are
Russian (all chars >127) and English (all < 128)

My current solution is to run without --enable-locale and do all the
sorting
in the client. But it would be often useful to have language specific
columns.

--------------------
Hannu

#39Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Oleg Broytmann (#36)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Well, the problem is that while I do occasionally need cyrillic chars,
I also need English, Estonian, Finnish/Swedish, Latvian and Lithuanian.
The only two of them that don't have overlapping character codes are
Russian (all chars >127) and English (all < 128)
My current solution is to run without --enable-locale and do all the
sorting
in the client. But it would be often useful to have language specific
columns.

Great! You're our representative for multi-charset usage :)

Perhaps you can think about and comment on the SQL92 features which
support this, and whether that kind of capability would suit your
needs. Of course, we would need to start getting specific about how to
map those generic features into a Postgres implementation...

btw, istm that we could make things somewhat backward compatible by
allowing the backend to be built with CHAR and NCHAR always mapped to
NCHAR.

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

#40Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#39)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

OK, SQL92 defines two kinds of native character sets: those we already
have (char, varchar) and those which can be locale customized (nchar,
national character varying, and others). char and varchar always
default to the "SQL" behavior (which I think corresponds to ascii
(called "SQL_TEXT") but I didn't bother looking for the details).

This seems to be a little bit different from the standard. First,
SQL_TEXT is not equal to ascii. It's a subset of ascii. Second, the
default charset for char and varchar might be implemenation dependent,
not neccesarily limited to SQL_TEXT. The only requirement is the
charset must contain the repertoire SQL_TEXT has. I think any charsets
including ascii I've ever seen satisfies the requirement. Third, the
standards says nothing about locale.
---
Tatsuo Ishii

#41Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#40)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Well, the problem is that while I do occasionally need cyrillic chars,
I also need English, Estonian, Finnish/Swedish, Latvian and Lithuanian.

Probably this is the first example ever appeared on this list for the
demand of i18n database, that should be the future direction of
PostgreSQL, in my opinion.

Currently MB has two encodings for this kind of purpose: Unicode and
mule-internal-code. Both of them allows mixed languages even in the
same column. This might give you a partial solution.

The only two of them that don't have overlapping character codes are
Russian (all chars >127) and English (all < 128)

My current solution is to run without --enable-locale and do all the
sorting
in the client. But it would be often useful to have language specific
columns.

This is another reason why we cannot rely on the locale mechanism
supported by os. Basically locale is a global data for the entire
process. If each column has different language, we have to switch
locales. That would be inefficient and painful.
---
Tatsuo Ishii

#42Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Tatsuo Ishii (#40)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

This seems to be a little bit different from the standard. First,
SQL_TEXT is not equal to ascii. It's a subset of ascii.

Yes, sorry. I was lazy in my posting.

Second, the
default charset for char and varchar might be implemenation dependent,
not neccesarily limited to SQL_TEXT. The only requirement is the
charset must contain the repertoire SQL_TEXT has. I think any charsets
including ascii I've ever seen satisfies the requirement.

Yow! I certainly misremembered the definition. Date and Darwen, 1997,
point out that the SQL implementation *must* support at least one
character set, SQL_TEXT, whose repertoire must contain:

1) Every character that is used in the SQL language itself (this is
the part I remembered), and

2) Every character that is included in *any other character set*
supported by the SQL implementation (Postgres).

This second requirement is presumably to enable text handling of
multiple character sets, but would seem to put severe restrictions on
how we would implement things. Or can it act only as a placeholder,
allowing us to define new character sets as different types in
Postgres? Otherwise, we would have to retrofit capabilities into
SQL_TEXT anytime we defined a new character set??

Third, the
standards says nothing about locale.

You are referring to the Unix-style system support for "locale"?
Certainly the NCHAR and character set support in SQL92 would qualify
as locale support in the generic sense...

Regards.

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

#43Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#42)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Second, the
default charset for char and varchar might be implemenation dependent,
not neccesarily limited to SQL_TEXT. The only requirement is the
charset must contain the repertoire SQL_TEXT has. I think any charsets
including ascii I've ever seen satisfies the requirement.

Yow! I certainly misremembered the definition. Date and Darwen, 1997,
point out that the SQL implementation *must* support at least one
character set, SQL_TEXT, whose repertoire must contain:

1) Every character that is used in the SQL language itself (this is
the part I remembered), and

2) Every character that is included in *any other character set*
supported by the SQL implementation (Postgres).

This second requirement is presumably to enable text handling of
multiple character sets, but would seem to put severe restrictions on
how we would implement things. Or can it act only as a placeholder,
allowing us to define new character sets as different types in
Postgres? Otherwise, we would have to retrofit capabilities into
SQL_TEXT anytime we defined a new character set??

I don't think so. 2) can be read as:

Any other character set must contain every character included in
SQL_TEXT.

This seems extremely easy to implement. We could define SQL_TEXT be a
subset of ASCII and almost any character set contains ASCII chars. As
a result, any character set satisfies above that is logically same as
2). No?

Third, the
standards says nothing about locale.

You are referring to the Unix-style system support for "locale"?

Yes.

Certainly the NCHAR and character set support in SQL92 would qualify
as locale support in the generic sense...

---
Tatsuo Ishii

#44Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Tatsuo Ishii (#43)
Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

SQL_TEXT ... repertoire must contain:
1) Every character that is used in the SQL language itself (this is
the part I remembered), and
2) Every character that is included in *any other character set*
supported by the SQL implementation (Postgres).
This second requirement is presumably to enable text handling of
multiple character sets, but would seem to put severe restrictions on
how we would implement things. Or can it act only as a placeholder,
allowing us to define new character sets as different types in
Postgres? Otherwise, we would have to retrofit capabilities into
SQL_TEXT anytime we defined a new character set??

I don't think so. 2) can be read as:
Any other character set must contain every character included in
SQL_TEXT.

Here is the text from the July, 1992 SQL92 draft standard:

The <implementation-defined character repertoire name>
SQL_TEXT
specifies the name of a character repertoire and implied
form-of-
use that can represent every character that is in <SQL
language
character> and all other characters that are in character
sets
supported by the implementation.

and later in the same doc:

11)The character set named SQL_TEXT is an
implementation-defined
character set whose character repertoire is SQL_TEXT.

I'm reading this to say that SQL_TEXT must contain the union of all
characters in the character sets in the implementation, rather than an
intersection between that union and the characters required by the SQL
language itself.

But I'm not really sure what they mean by this, or whether it is a
problem or not. Clearly different character sets and collations can be
mixed only when that can preserve meaning, so saying that SQL_TEXT has
a repertoire which contains ASCII characters and Japanese characters
doesn't seem to help much.

So istm that "SQL_TEXT" might be just a container class for all
characters in the installation, which still doesn't make complete
sense to me wrt a Postgres implementation.

- Tom

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California