spanish locale question

Started by jbiskofskialmost 14 years ago9 messagesgeneral
Jump to latest
#1jbiskofski
jbiskofski@gmail.com

Hello everyone,

I have a lc_collate problem. Im in Mexico and I need the following three
lastnames to be sorted this way :

álvarez ( accent on first a )
chavez
cota

Using the default locale on my mac ( en_US ) I end up with :

chavez
cota
álvarez

So I switched to es_ES.ISO8859-15 and that gives me :

álvarez
cota
chavez

... There was a time when the "Real Academia Española" considered "CH",
"LL" and "SH" as letters. They changed that in 1994 :

I*n 1994, the RAE ruled that the Spanish consonants **"CH" (ché) and "LL" (
elle) would hence be alphabetized under "C" and under "L", respectively,
and not as separate, discrete letters, as in the past. The RAE eliminated
monosyllabic accented vowels where the accent did not serve in changing the
word's meaning, examples include: "dio" ("gave"), "vio" ("saw"), both had
an acutely-accented vowel "ó"; yet the monosyllabic word "sé" ("I know",
the first person, singular, present of "saber", "to know"; and the singular
imperative of "ser", "to be") retains its acutely-accented vowel in order
to differentiate it from the reflexive pronoun "se".*
*
*
*http://en.wikipedia.org/wiki/Real_Academia_Espa%C3%B1ola*
*
*
*
*
I think thats where the problem comes from.

Anyway, any hints/clues/suicide-method-suggestions would be greatly
appreciated!

Thanks!

- Jose Biskofski
www.algebraix.com

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: jbiskofski (#1)
Re: spanish locale question

jbiskofski wrote:

I have a lc_collate problem. Im in Mexico and I need the following three lastnames to be sorted this
way :

álvarez ( accent on first a )
chavez
cota

Using the default locale on my mac ( en_US ) I end up with :

chavez
cota
álvarez

So I switched to es_ES.ISO8859-15 and that gives me :

álvarez
cota
chavez

... There was a time when the "Real Academia Española" considered "CH", "LL" and "SH" as letters. They
changed that in 1994 :

In 1994, the RAE ruled that the Spanish consonants "CH" (ché) and "LL" (elle) would hence be
alphabetized under "C" and under "L", respectively, and not as separate, discrete letters, as in the
past. The RAE eliminated monosyllabic accented vowels where the accent did not serve in changing the
word's meaning, examples include: "dio" ("gave"), "vio" ("saw"), both had an acutely-accented vowel
"ó"; yet the monosyllabic word "sé" ("I know", the first person, singular, present of "saber", "to
know"; and the singular imperative of "ser", "to be") retains its acutely-accented vowel in order to
differentiate it from the reflexive pronoun "se".

http://en.wikipedia.org/wiki/Real_Academia_Espa%C3%B1ola

I think thats where the problem comes from.

Anyway, any hints/clues/suicide-method-suggestions would be greatly appreciated!

PostgreSQL uses the operating system's collations.
Ask your operating system provider.

On my RHEL 3 Linux system it works as you want it to:

CREATE TABLE mexico(id integer PRIMARY KEY, val text NOT NULL COLLATE "es_ES.utf8");

INSERT INTO mexico VALUES (1, 'cota'), (2, 'álvarez'), (3, 'chavez');

SELECT * FROM mexico ORDER BY val;

id | val
----+---------
2 | álvarez
3 | chavez
1 | cota
(3 rows)

Yours,
Laurenz Albe

#3jbiskofski
jbiskofski@gmail.com
In reply to: Laurenz Albe (#2)
Re: spanish locale question

Laurenz, thank you so much for your help. I had a hard time getting the
es_MX locale installed on my freebsd system, but once I did it worked
perfectly. Thanks for your help again.

- Jose

On Thu, Apr 19, 2012 at 2:30 AM, Albe Laurenz <laurenz.albe@wien.gv.at>wrote:

Show quoted text

jbiskofski wrote:

I have a lc_collate problem. Im in Mexico and I need the following three

lastnames to be sorted this

way :

álvarez ( accent on first a )
chavez
cota

Using the default locale on my mac ( en_US ) I end up with :

chavez
cota
álvarez

So I switched to es_ES.ISO8859-15 and that gives me :

álvarez
cota
chavez

... There was a time when the "Real Academia Española" considered "CH",

"LL" and "SH" as letters. They

changed that in 1994 :

In 1994, the RAE ruled that the Spanish consonants "CH" (ché) and "LL"

(elle) would hence be

alphabetized under "C" and under "L", respectively, and not as separate,

discrete letters, as in the

past. The RAE eliminated monosyllabic accented vowels where the accent

did not serve in changing the

word's meaning, examples include: "dio" ("gave"), "vio" ("saw"), both

had an acutely-accented vowel

"ó"; yet the monosyllabic word "sé" ("I know", the first person,

singular, present of "saber", "to

know"; and the singular imperative of "ser", "to be") retains its

acutely-accented vowel in order to

differentiate it from the reflexive pronoun "se".

http://en.wikipedia.org/wiki/Real_Academia_Espa%C3%B1ola

I think thats where the problem comes from.

Anyway, any hints/clues/suicide-method-suggestions would be greatly

appreciated!

PostgreSQL uses the operating system's collations.
Ask your operating system provider.

On my RHEL 3 Linux system it works as you want it to:

CREATE TABLE mexico(id integer PRIMARY KEY, val text NOT NULL COLLATE
"es_ES.utf8");

INSERT INTO mexico VALUES (1, 'cota'), (2, 'álvarez'), (3, 'chavez');

SELECT * FROM mexico ORDER BY val;

id | val
----+---------
2 | álvarez
3 | chavez
1 | cota
(3 rows)

Yours,
Laurenz Albe

#4Tulio
tulio.carrasco@gmail.com
In reply to: jbiskofski (#3)
Re: spanish locale question

Hi there,

Let me expand the collate situation. I´m from Perú and I have turned
everything in postgresql.conf as 'es_PE.UTF-8' even the
default_text_search_config = 'pg_catalog.spanish'. Even my Ubuntu 12.04
works in English I have es_PE locale too.

In Spanish vowels as "u" is equal as "ú" or even "ü". So let me build an
example:

CREATE TABLE pru(id integer PRIMARY KEY, dad text, mum text, name text);

INSERT INTO pru VALUES (1, 'león','valencia', 'josé'), (2, 'leon',
'mendoza', 'juan'), (3, 'león', 'valárd', 'jose'), (4, 'león','válencia',
'jos'), (5, 'león', 'mendoza', 'jua'), (6, 'leon', 'valencia', 'josie'), (7,
'león', 'valencia', 'josie');

if I do

SELECT * FROM pru order by dad,mum,name;

I get:

id | dad | mum | name
----+------+----------+-------
2 | leon | mendoza | juan
6 | leon | valencia | josie
5 | león | mendoza | jua
3 | león | valárd | jose
1 | león | valencia | josé
7 | león | valencia | josie
4 | león | válencia | jos
(7 rows)

Which is a wrong order (collation) in Spanish and I don´t understand why.
But, I noticed that if I do:

SELECT * FROM pru order by dad || mum || name;

I get the correct order:

id | dad | mum | name
----+------+----------+-------
5 | león | mendoza | jua
2 | leon | mendoza | juan
3 | león | valárd | jose
4 | león | válencia | jos
1 | león | valencia | josé
6 | leon | valencia | josie
7 | león | valencia | josie
(7 rows)

Is this the correct way to order in Postgresql and if it´s not Does anyone
have an idea and could please explain it to me?
Thanks in advance.

Tulio

--
View this message in context: http://postgresql.1045698.n5.nabble.com/spanish-locale-question-tp5650043p5681819.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

#5Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tulio (#4)
Re: spanish locale question

Tulio wrote:

Let me expand the collate situation. I´m from Perú and I have turned
everything in postgresql.conf as 'es_PE.UTF-8' even the
default_text_search_config = 'pg_catalog.spanish'. Even my Ubuntu 12.04
works in English I have es_PE locale too.

if I do

SELECT * FROM pru order by dad,mum,name;

I get:

id | dad | mum | name
----+------+----------+-------
2 | leon | mendoza | juan
6 | leon | valencia | josie
5 | león | mendoza | jua
3 | león | valárd | jose
1 | león | valencia | josé
7 | león | valencia | josie
4 | león | válencia | jos
(7 rows)

Which is a wrong order (collation) in Spanish and I don´t understand why.

Maybe you misunderstood what it means to ORDER BY multiple
columns. In your query, the rows are ordered by "dad",
then all rows where "dad" is the same are ordered by "mum",
and finally all rows where "dad" and "mum" are the same
are ordered by "name".

It is explained in the documentation:
http://www.postgresql.org/docs/current/static/queries-order.html

"When more than one expression is specified, the later values
are used to sort rows that are equal according to the earlier values."

But, I noticed that if I do:

SELECT * FROM pru order by dad || mum || name;

I get the correct order:

id | dad | mum | name
----+------+----------+-------
5 | león | mendoza | jua
2 | leon | mendoza | juan
3 | león | valárd | jose
4 | león | válencia | jos
1 | león | valencia | josé
6 | leon | valencia | josie
7 | león | valencia | josie
(7 rows)

Is this the correct way to order in Postgresql and if it´s not Does anyone
have an idea and could please explain it to me?

This is not PostgreSQL-specific behaviour, it is defined in the
SQL standard and works like this on all database systems I know.

You can use the ORDER BY clause you propose if you prefer this
ordering.

But would you really order 'leon', 'mendoza', 'juan' before
'leo', 'zara', 'juan'?

Yours,
Laurenz Albe

#6Al Eridani
al.eridani@gmail.com
In reply to: Laurenz Albe (#5)
Re: spanish locale question

On Fri, May 4, 2012 at 1:22 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:

But would you really order 'leon', 'mendoza', 'juan' before
'leo', 'zara', 'juan'?

No, I guess that Tulio would not do that, because (usually) 'leo' goes
before 'leon'.

What Tulio is saying is that 'leon' and 'león' are the same thing from
the point of view of sorting in Spanish, but his PostgreSQL seems to
think that 'leon' goes before 'león'.

Al

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Al Eridani (#6)
Re: spanish locale question

Al Eridani <al.eridani@gmail.com> writes:

What Tulio is saying is that 'leon' and 'le�n' are the same thing from
the point of view of sorting in Spanish, but his PostgreSQL seems to
think that 'leon' goes before 'le�n'.

Postgres never considers that two distinct strings are "equal". If the
locale setting considers these equal (which isn't entirely clear from
the given evidence), PG would then sort them on the basis of their
character code values.

A possible workaround if you need to consider them equal is to strip the
accents before sorting (ie, something like "ORDER BY to_ascii(col)") but
this may well throw away more information than you want ...

regards, tom lane

#8Andreas Joseph Krogh
andreak@officenet.no
In reply to: Tom Lane (#7)
Re: spanish locale question

On 05/04/2012 07:31 PM, Tom Lane wrote:

Al Eridani<al.eridani@gmail.com> writes:

What Tulio is saying is that 'leon' and 'le�n' are the same thing from
the point of view of sorting in Spanish, but his PostgreSQL seems to
think that 'leon' goes before 'le�n'.

Postgres never considers that two distinct strings are "equal". If the
locale setting considers these equal (which isn't entirely clear from
the given evidence), PG would then sort them on the basis of their
character code values.

A possible workaround if you need to consider them equal is to strip the
accents before sorting (ie, something like "ORDER BY to_ascii(col)") but
this may well throw away more information than you want ...

Note that to_ascii barfs on unicode-input:

ERROR: encoding conversion from UTF8 to ASCII not supported

Better install unaccent:

cd ./postgresql-9.1.2/contrib/unaccent
make install
psql
CREATE EXTENSION unaccent;
andreak=# select unaccent('le�n');
unaccent
----------
leon
(1 row)

--
Andreas Joseph Krogh<andreak@officenet.no> - mob: +47 909 56 963
Senior Software Developer / CEO - OfficeNet AS - http://www.officenet.no
Public key: http://home.officenet.no/~andreak/public_key.asc

#9Tulio
tulio.carrasco@gmail.com
In reply to: Laurenz Albe (#5)
Re: spanish locale question

Thank you Laurenz for your answer.

I think I understand the meaning of ORDER BY my problem is the
understanding of COLLATE functionality. I thought that if I collate in es.PE
(Spanish) the "o" has the exact value of "ó" so if I

INSERT INTO pru VALUES (8, 'leo','zara', 'juan');

How can I obtain this order:

id | dad | mum | name
----+------+----------+-------
8 | leo | zara | juan
5 | león | mendoza | jua
2 | leon | mendoza | juan
3 | león | valárd | jose
4 | león | válencia | jos
1 | león | valencia | josé
6 | leon | valencia | josie
7 | león | valencia | josie

In the understanding (please forgive I repeat) that "o" and "ó" has no
difference in weight in the Spanish collation. As you can see below in the
first table of your answer the collation doesn´t work. It assumes that leon
is different from león (with accent). I was assuming the collation
eliminates such difference. It isn´t that way it works?

Thanks in advance for your help.

Tulio

Albe Laurenz *EXTERN* wrote

Tulio wrote:

Let me expand the collate situation. I´m from Perú and I have turned
everything in postgresql.conf as 'es_PE.UTF-8' even the
default_text_search_config = 'pg_catalog.spanish'. Even my Ubuntu 12.04
works in English I have es_PE locale too.

if I do

SELECT * FROM pru order by dad,mum,name;

I get:

id | dad | mum | name
----+------+----------+-------
2 | leon | mendoza | juan
6 | leon | valencia | josie
5 | león | mendoza | jua
3 | león | valárd | jose
1 | león | valencia | josé
7 | león | valencia | josie
4 | león | válencia | jos
(7 rows)

Which is a wrong order (collation) in Spanish and I don´t understand why.

Maybe you misunderstood what it means to ORDER BY multiple
columns. In your query, the rows are ordered by "dad",
then all rows where "dad" is the same are ordered by "mum",
and finally all rows where "dad" and "mum" are the same
are ordered by "name".

It is explained in the documentation:
http://www.postgresql.org/docs/current/static/queries-order.html

"When more than one expression is specified, the later values
are used to sort rows that are equal according to the earlier values."

But, I noticed that if I do:

SELECT * FROM pru order by dad || mum || name;

I get the correct order:

id | dad | mum | name
----+------+----------+-------
5 | león | mendoza | jua
2 | leon | mendoza | juan
3 | león | valárd | jose
4 | león | válencia | jos
1 | león | valencia | josé
6 | leon | valencia | josie
7 | león | valencia | josie
(7 rows)

Is this the correct way to order in Postgresql and if it´s not Does
anyone
have an idea and could please explain it to me?

This is not PostgreSQL-specific behaviour, it is defined in the
SQL standard and works like this on all database systems I know.

You can use the ORDER BY clause you propose if you prefer this
ordering.

But would you really order 'leon', 'mendoza', 'juan' before
'leo', 'zara', 'juan'?

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
View this message in context: http://postgresql.1045698.n5.nabble.com/spanish-locale-question-tp5650043p5687242.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.