LC_COLLATE=es_MX in PgSQL 7.3.2

Started by Octavio Alvarezalmost 23 years ago7 messagesgeneral
Jump to latest
#1Octavio Alvarez
alvarezp@octavio.ods.org

Ohh! Look at this... I think query #4 does not handle the accents
correctly, and it should behave almost like #5 does, except for the record
with p = 'R' which should be at the top (if they were #4's results).

#5 is only shown as a sample. #5 works correctly.

I'm using PGSQL 7.3.2 under Redhat Linux 8.0. The database was initialized
with --lc-collate=es_MX.

1. This is the table:
testdb=# \d t
Table "public.t"
Column | Type | Modifiers
--------+-----------------------+-----------
p | character varying(20) |
m | character varying(20) |

2. This is the data:
testdb=# SELECT * FROM t;
p | m
----+------
ra | ab
re | cc
ri | cd
r� | gg
r� | zz
RE | t3
R | t�3
R� | t3
R� | t4
R� | am4
R� | am 5
(11 rows)

3. I think the following is correct:
testdb=# SELECT * FROM t ORDER BY p;
p | m
----+------
R | t�3
ra | ab
re | cc
RE | t3
r� | gg
r� | zz
R� | t3
R� | t4
ri | cd
R� | am 5
R� | am4
(11 rows)

4. The following is NOT correct. Take a look at #5 for the almost-correct
behavior.
testdb=# SELECT * FROM t ORDER BY p, m;
p | m
----+------
R | t�3
ra | ab
re | cc
RE | t3
r� | gg
r� | zz
R� | t3
R� | t4
ri | cd
R� | am 5
R� | am4
(11 rows)

5. The following shows the result somewhat like they should be. Actually,
the "R" alone should be at the beginning (not in this query, but if these
were the results of the previous one. This one is working ok.)
testdb=# SELECT * FROM t ORDER BY p || m;
p | m
----+------
ra | ab
re | cc
r� | gg
RE | t3
R� | t3
R� | t4
r� | zz
R� | am4
R� | am 5
ri | cd
R | t�3
(11 rows)

--
Octavio Alvarez Piza.
E-mail: alvarezp@octavio.ods.org

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Octavio Alvarez (#1)
Re: LC_COLLATE=es_MX in PgSQL 7.3.2

"Octavio Alvarez" <alvarezp@octavio.ods.org> writes:

Ohh! Look at this... I think query #4 does not handle the accents
correctly, and it should behave almost like #5 does, except for the record
with p = 'R' which should be at the top (if they were #4's results).
I'm using PGSQL 7.3.2 under Redhat Linux 8.0. The database was initialized
with --lc-collate=es_MX.

How about --lc-ctype? I think that accent handling would be driven by
LC_CTYPE not LC_COLLATE. In any case, this is not a Postgres bug unless
you can show that other programs using the same LC_foo settings behave
differently. We punt pretty much all locale-related processing to
subroutines in libc.

regards, tom lane

#3Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Octavio Alvarez (#1)
Re: LC_COLLATE=es_MX in PgSQL 7.3.2

On Wed, Jun 11, 2003 at 04:40:42PM -0700, Octavio Alvarez wrote:

Ohh! Look at this... I think query #4 does not handle the accents
correctly, and it should behave almost like #5 does, except for the record
with p = 'R' which should be at the top (if they were #4's results).

I don't understand. Are you saying that "r�" should sort the same as
"R�", and that the ambiguity should be solved by the next column? Well
if it's that, I'm afraid it's not the way the locale is defined. I wish
I could tell you how it is defined, but I don't really understand it.
You can see for yourself in:

http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/localedata/locales/es_ES?cvsroot=glibc

The rules seem pretty convoluted.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"El destino baraja y nosotros jugamos" (A. Schopenhauer)

#4Octavio Alvarez
alvarezp@octavio.ods.org
In reply to: Alvaro Herrera (#3)
Re: LC_COLLATE=es_MX in PgSQL 7.3.2

Tom Lane said:

I'm using PGSQL 7.3.2 under Redhat Linux 8.0. The database was
initialized
with --lc-collate=es_MX.

How about --lc-ctype? I think that accent handling would be driven by
LC_CTYPE not LC_COLLATE.

May be it's not the accents after all. I did the following tests without
accents.

Okay. Now, I tried several combinations, including --locale=es_MX and
--lc-collate=es_MX --lc-ctype=es_MX, and got the same result.

I would like to point out something: (still PG 7.3.2)

I tried the following with --locale=es_MX, with --locale=en_US, with
--locale=en_US.UTF-8.

alvarezp=# select * from t order by p asc, m asc;
p | m
-------+-------
octav | alvar
OCTAV | ALVAA
OCTAV | ALVAZ
octia | alvra
OCTIa | ALVAa
OCTIb | ALVZa
OCTIb | ALVZa
octic | alvra
OCTIc | ALVAa
octvi | alvra
OCTVI | ALVAa
OCTVI | ALVZa
(12 rows)

No accents here. I would have expected:
p | m
-------+-------
OCTAV | ALVAA
octav | alvar
OCTAV | ALVAZ
OCTIa | ALVAa
octia | alvra
OCTIb | ALVZa
OCTIb | ALVZa
OCTIc | ALVAa
octic | alvra
OCTVI | ALVAa
octvi | alvra
OCTVI | ALVZa
(12 rows)

--locale=C gives out
p | m
-------+-------
OCTAV | ALVAA
OCTAV | ALVAZ
OCTIa | ALVAa
OCTIb | ALVZa
OCTIb | ALVZa
OCTIc | ALVAa
OCTVI | ALVAa
OCTVI | ALVZa
octav | alvar
octia | alvra
octic | alvra
octvi | alvra
(12 rows)

which I thnk is correct for that locale. Well, whatever.

In any case, this is not a Postgres bug unless
you can show that other programs using the same LC_foo settings behave
differently. We punt pretty much all locale-related processing to
subroutines in libc.

How could I test that? I tried the following. Notice how the "octav"
values are correctly sorted, but I don't know if SORT is actually
separating the fields or understanding the whole line as 1 key.

[alvarezp@pgsql alvarezp]$ sort -t : < o
OCTAV:ALVAA
octav:alvar
OCTAV:ALVAZ
OCTIa:ALVAa
octia:alvra
OCTIb:ALVZa
OCTIb:ALVZa
OCTIc:ALVAa
octic:alvra
OCTVI:ALVAa
octvi:alvra
OCTVI:ALVZa

Whatever. Take a look at this one:

[alvarezp@pgsql alvarezp]$ sort -k 1,1 < o
octav alvar
OCTAV ALVAA
OCTAV ALVAZ
octia alvra
OCTIa ALVAa
OCTIb ALVZa
OCTIb ALVZa
octic alvra
OCTIc ALVAa
octvi alvra
OCTVI ALVAa
OCTVI ALVZa

I don't know if detection of which keys are equal (in this case
octav=OCTAV=OCTAV) should be made by PostgreSQL or libc. I also don't know
if I am wrong assuming octav=OCTAV. For alphabetic sorting, it should be
case insensitive.

Octavio.

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Octavio Alvarez (#4)
Re: LC_COLLATE=es_MX in PgSQL 7.3.2

"Octavio Alvarez" <alvarezp@octavio.ods.org> writes:

I don't know if detection of which keys are equal (in this case
octav=OCTAV=OCTAV) should be made by PostgreSQL or libc. I also don't know
if I am wrong assuming octav=OCTAV. For alphabetic sorting, it should be
case insensitive.

I believe that the usual locale definitions set things up so that
strings of different case are not "equal", but case differences are
the last tiebreaker when sorting. So "octav" sorts before "OCTAV"
but after everything else that's less than "OCTAV".

I haven't studied your results to see if they are consistent with
that viewpoint, but I'm really not ready to believe that Postgres
is sorting in a way that's not what the active locale says to do.
We have not heard such a report before. (We have, on the other
hand, heard from lots of people who were surprised at the properties
of their default locale ...)

regards, tom lane

#6Octavio Alvarez
alvarezp@octavio.ods.org
In reply to: Alvaro Herrera (#3)
Re: LC_COLLATE=es_MX in PgSQL 7.3.2

On Wednesday 11 June 2003 21:34, Alvaro Herrera wrote:

On Wed, Jun 11, 2003 at 04:40:42PM -0700, Octavio Alvarez wrote:

Ohh! Look at this... I think query #4 does not handle the accents
correctly, and it should behave almost like #5 does, except for the
record with p = 'R' which should be at the top (if they were #4's
results).

I don't understand. Are you saying that "ré" should sort the same as
"RÉ", and that the ambiguity should be solved by the next column? Well
if it's that, I'm afraid it's not the way the locale is defined.

Yes. In fact, after doing some tests, I noticed that that doesn't even work
forn non-accented characters! For example, my name, OCTAVIO is different than
octavio. As of strcoll and strcmp, they will of course be different. If I
were only to sort by name, it would be correct to consider them as the
different values, but if sorting by name and last name ("alvarez", "rivera",
for instance), "OCTAVIO rivera" should be sorted AFTER "octavio ALVAREZ"
(which is not being done like that because "OCTAVIO" > "octavio" in the
sorting of the first column).

One approach would be sorting by name || ' ' || lastname, but I don't know how
that could slow down my queries. I will try that today.

Octavio.

#7Peter Eisentraut
peter_e@gmx.net
In reply to: Octavio Alvarez (#4)
Re: LC_COLLATE=es_MX in PgSQL 7.3.2

Octavio Alvarez writes:

No accents here. I would have expected:
p | m
-------+-------
OCTAV | ALVAA
octav | alvar
OCTAV | ALVAZ
OCTIa | ALVAa
octia | alvra
OCTIb | ALVZa
OCTIb | ALVZa
OCTIc | ALVAa
octic | alvra
OCTVI | ALVAa
octvi | alvra
OCTVI | ALVZa
(12 rows)

That's not the way it works. First it sorts by p, then by m -- there is
no cross-column magic. Maybe for your application you will be happier if
you sort by lower(p), lower(m).

--
Peter Eisentraut peter_e@gmx.net