varchar sort ordering ignore blanks

Started by Luca Arzeniover 18 years ago8 messagesgeneral
Jump to latest
#1Luca Arzeni
l.arzeni@amadego.com

Hi there,
I have a table with a single column, pk of varchar type

The table contains few names, say:

XXXX A
XXXX C
XXXXB

In the first two records there is a between the XXXX and the following letter
A and C while, the third one has a B immediately following the XXXX (without
blanks).

In postgres 7.4.7 (debian sarge), if I issue a select to sort the record I
(correctly) obtain:
XXXX A
XXXX C
XXXXB

In postgres 8.1.9 (debian etch), if I issue a select to sort the record I
(mistakenly) obtain:
XXXX A
XXXXB
XXXX C

That is: the sort order in postgres 8.1.9 seems to ignore the blank.

In all cases I'm using locale LATIN9 during DB creation, but I tested also
with ASCII, UTF8 and LATIN1 encoding.

Can someone help me to get the correct order in postgres 8.1.9 ?

=== Sample code ===

CREATE TABLE t_table
(
c_column varchar(30) NOT NULL,
CONSTRAINT t_table_pk PRIMARY KEY (c_column)
)
WITHOUT OIDS;

INSERT INTO t_table(c_column) VALUES ('XXXX A');
INSERT INTO t_table(c_column) VALUES ('XXXXB');
INSERT INTO t_table(c_column) VALUES ('XXXX C');

select * from t_table order by c_column asc;

=============

Thanks, Luca Arzeni

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Luca Arzeni (#1)
Re: varchar sort ordering ignore blanks

Luca Arzeni <l.arzeni@amadego.com> writes:

That is: the sort order in postgres 8.1.9 seems to ignore the blank.

This is expected behavior in most non-C locales.

In all cases I'm using locale LATIN9 during DB creation, but I tested also
with ASCII, UTF8 and LATIN1 encoding.

LATIN9 isn't a locale, it's an encoding. Try "initdb --locale=C".

regards, tom lane

#3Luca Arzeni
l.arzeni@gmail.com
In reply to: Tom Lane (#2)
Re: varchar sort ordering ignore blanks

On Jan 15, 2008 6:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Luca Arzeni <l.arzeni@amadego.com> writes:

That is: the sort order in postgres 8.1.9 seems to ignore the blank.

This is expected behavior in most non-C locales.

In all cases I'm using locale LATIN9 during DB creation, but I tested

also

with ASCII, UTF8 and LATIN1 encoding.

LATIN9 isn't a locale, it's an encoding. Try "initdb --locale=C".

regards, tom lane

--------------------
I guess this has nothing to do with the encoding, but with the collation
rules used, which is governed by "lc_collate" parameter. See what you
get on both DBs for:

SHOW lc_collate ;

HTH,
Csaba.
-------------------

Thanks Tom, and Csaba

both of you hit the problem: actually Postgres 7.4.7 has a C locale and
Postgres 8.1 has US.UTF8 locale. Setting locale to locale=C or locale=POSIX
for release 8.1 solved this issue, but it opens another one: if I use
locale=C, I get

XXXX A
XXXX C
XXXXB

as sort order, but this setting gives me an error when it cames to:

XXXX d
XXXX e
XXXX f
XXXX è

because the right sort ordering should be:

XXXX d
XXXX e
XXXX è
XXXX f

So the problem is:

- C or POSIX locale is OK with blanks but fails on locale specific vowels
- LATIN9 locale is OK with vowels but ignores blanks

Is there any way to consider blanks meaningfull AND sort properly locale
specific vowels ?

I don't know what SQL standard says about this issue, but I'm sure that in
Italy you sort names considering vowels AND blanks!

Thanks, Luca

#4Noname
larzeni@amadego.com
In reply to: Luca Arzeni (#3)
Re: varchar sort ordering ignore blanks

On Jan 15, 2008 6:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Luca Arzeni <l.arzeni@amadego.com> writes:
That is: the sort order in postgres 8.1.9 seems to ignore the blank.

--------------------
This is expected behavior in most non-C locales.
Try "initdb --locale=C".
regards, tom lane

--------------------
I guess this has nothing to do with the encoding, but with
the collation rules used, which is governed by "lc_collate"
parameter. See what you get on both DBs for:
SHOW lc_collate ;
HTH,
Csaba.

Thanks Tom, and Csaba

both of you hit the problem: actually Postgres 7.4.7 has a C locale and
Postgres 8.1 has US.UTF8 locale. Setting locale to locale=C or
locale=POSIX for release 8.1 solved this issue, but it opens another one:
if I use locale=C, I get:

XXXX A
XXXX C
XXXXB

as sort order (which is fine regarding blanks), but this setting gives me
an error when it cames to:

XXXX d
XXXX e
XXXX f
XXXX �

because the right sort ordering should be:

XXXX d
XXXX e
XXXX �
XXXX f

So the problem is:

- C or POSIX locale is OK with blanks but fails on locale specific vowels
- LATIN9 locale is OK with vowels but ignores blanks

Is there any way to consider blanks meaningfull AND sort properly locale
specific vowels ?

I don't know what SQL standard says about this issue, but I'm sure that in
Italy you sort names considering vowels AND blanks :-} !

Thanks, Luca

#5Luca Arzeni
l.arzeni@amadego.com
In reply to: Noname (#4)
Re: varchar sort ordering ignore blanks

On Jan 15, 2008 6:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Luca Arzeni <l.arzeni@amadego.com> writes:
That is: the sort order in postgres 8.1.9 seems to ignore the blank.

--------------------
This is expected behavior in most non-C locales.
Try "initdb --locale=C".
regards, tom lane

--------------------
I guess this has nothing to do with the encoding, but with
the collation rules used, which is governed by "lc_collate"
parameter. See what you get on both DBs for:
SHOW lc_collate ;
HTH,
Csaba.

Thanks Tom, and Csaba

both of you hit the problem: actually Postgres 7.4.7 has a C locale and
Postgres 8.1 has US.UTF8 locale. Setting locale to locale=C or
locale=POSIX for release 8.1 solved this issue, but it opens another one:
if I use locale=C, I get:

XXXX A
XXXX C
XXXXB

as sort order (which is fine regarding blanks), but this setting gives me
an error when it cames to:

XXXX d
XXXX e
XXXX f
XXXX �

because the right sort ordering should be:

XXXX d
XXXX e
XXXX �
XXXX f

So the problem is:

- C or POSIX locale is OK with blanks but fails on locale specific vowels
- LATIN9 locale is OK with vowels but ignores blanks

Is there any way to consider blanks meaningfull AND sort properly locale
specific vowels ?

I don't know what SQL standard says about this issue, but I'm sure that in
Italy you sort names considering vowels AND blanks :-} !

Thanks, Luca

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Luca Arzeni (#5)
Re: varchar sort ordering ignore blanks

"Luca Arzeni" <l.arzeni@amadego.com> writes:

Is there any way to consider blanks meaningfull AND sort properly locale
specific vowels ?

This isn't a Postgres question, it's a locale question. (If you try,
you'll find that sort(1) sorts the same as we do in any given locale.)

I imagine you could create a custom locale definition that acts this
way, but I have no idea about the degree of wizardry involved.
"man localedef" would probably be a place to start.

If you come up with a reasonably simple recipe for this, please post
it here, as you're not the first to have asked and you likely won't
be the last ...

regards, tom lane

#7Pavel Stehule
pavel.stehule@gmail.com
In reply to: Luca Arzeni (#5)
Re: varchar sort ordering ignore blanks

Hello,

you have to use correct localses for your encoding and country:

for czech and utf8 is

cs_CZ.UTF8 ..

for latin2 is

cs_CZ.latin2 etc

czech sorting has more exception and it works

caa
čaa
daa
cha ... it is well for czech
iaa

Show quoted text

On 20/01/2008, Luca Arzeni <l.arzeni@amadego.com> wrote:

On Jan 15, 2008 6:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Luca Arzeni <l.arzeni@amadego.com> writes:
That is: the sort order in postgres 8.1.9 seems to ignore the blank.

--------------------
This is expected behavior in most non-C locales.
Try "initdb --locale=C".
regards, tom lane

--------------------
I guess this has nothing to do with the encoding, but with
the collation rules used, which is governed by "lc_collate"
parameter. See what you get on both DBs for:
SHOW lc_collate ;
HTH,
Csaba.

Thanks Tom, and Csaba

both of you hit the problem: actually Postgres 7.4.7 has a C locale and
Postgres 8.1 has US.UTF8 locale. Setting locale to locale=C or
locale=POSIX for release 8.1 solved this issue, but it opens another one:
if I use locale=C, I get:

XXXX A
XXXX C
XXXXB

as sort order (which is fine regarding blanks), but this setting gives me
an error when it cames to:

XXXX d
XXXX e
XXXX f
XXXX è

because the right sort ordering should be:

XXXX d
XXXX e
XXXX è
XXXX f

So the problem is:

- C or POSIX locale is OK with blanks but fails on locale specific vowels
- LATIN9 locale is OK with vowels but ignores blanks

Is there any way to consider blanks meaningfull AND sort properly locale
specific vowels ?

I don't know what SQL standard says about this issue, but I'm sure that in
Italy you sort names considering vowels AND blanks :-} !

Thanks, Luca

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org/

#8Luca Arzeni
l.arzeni@amadego.com
In reply to: Tom Lane (#6)
Re: varchar sort ordering ignore blanks

On Sunday 20 January 2008 01:07, Tom Lane wrote:

"Luca Arzeni" <l.arzeni@amadego.com> writes:

Is there any way to consider blanks meaningfull AND sort properly locale
specific vowels ?

This isn't a Postgres question, it's a locale question. (If you try,
you'll find that sort(1) sorts the same as we do in any given locale.)

I imagine you could create a custom locale definition that acts this
way, but I have no idea about the degree of wizardry involved.
"man localedef" would probably be a place to start.

If you come up with a reasonably simple recipe for this, please post
it here, as you're not the first to have asked and you likely won't
be the last ...

regards, tom lane

Thanks Tom,
I gave a look at localedef, but it seems too much complex for my
understanding.

I understood that (under debian etch) lc_collate is defined for posix in file:
/usr/share/i18n/locales/POSIX

Here actually I can find an undestandable LC_COLLATE directive that defines
all chars one after the other in ASCII order. Thats fine.

Then I looked at it_IT locale definition and noticed that this locale (and
many if not all other locales, as far as I can see) have a collation order
inherited from the file:
/usr/share/i18n/locales/iso14651_t1

This seems to be a iso standard, but it is not easily understandable by me. It
includes macros and defines also sorting for many character sets, including
arabic and grec.

I think I've found the problem (space is actually ignored and put in a
separate list from other chars), but I'm not able to understand what shoud be
the proper value to put in the row.

At line 537 I can read the following lines:
order_start <SPECIAL>;forward;backward;forward;forward,position
#
# Tout caractère non précisément défini sera considéré comme caractère spécial
# et considéré uniquement au dernier niveau.
#
# Any character not precisely specified will be considered as a special
# character and considered only at the last level.
# <U0000>......<U7FFFFFFF> IGNORE;IGNORE;IGNORE;<U0000>......<U7FFFFFFF>
#
# SYMB. N° GLY
#
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>

so I guess space is beeing ignored.

At line 810 I can read:
<U2079> <9>;<BAS>;<EMI>;IGNORE # 197 <9S>
#
<U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a
<U00AA> <a>;<PCL>;<EMI>;IGNORE # 199 ª
<U00E1> <a>;<ACA>;<MIN>;IGNORE # 200 á
<U00E0> <a>;<GRA>;<MIN>;IGNORE # 201 à
<U00E2> <a>;<CIR>;<MIN>;IGNORE # 202 â
<U00E3> <a>;<TIL>;<MIN>;IGNORE # 203 ã
<U00E4> <a>;<REU>;<MIN>;IGNORE # 204 ä
<U00E5> <a>;<RNE>;<MIN>;IGNORE # 205 å

So my guess is that if I could put properly the space AFTER the commented line
and before the linte that starts with <U0061> (that is, before the lower "a")
I could solve the problem.

May be that some locale wizard is listening around and can help me?

Thanks, Luca

--
Ing. Luca Arzeni
Amadego S.R.L.
tel.: 02 6193672
cell.: 339 8350298
mailto: l.arzeni@amadego.com

=== Start-of Internet E-mail Confidentiality Footer ===

L'uso non autorizzato di questo messaggio o dei suoi allegati e' vietato e
potrebbe costituire reato.
Se ha ricevuto per errore questo messaggio, La preghiamo di informarci e di
distruggerlo immediatamente coi suoi allegati.
Le dichiarazioni contenute in questo messaggio o nei suoi allegati non
impegnano Amadego S.R.L. nei confronti del destinatario o di terzi.
Amadego S.R.L. non si assume alcuna responsabilita' per eventuali
intercettazioni, modifiche o danneggiamenti del presente messaggio.

Any unauthorized use of this e-mail or any of its attachments is prohibited
and could constitute an offence.
If you are not the intended addressee please advise immediately the sender and
destroy the message and its attachments.
The contents of this message shall be understood as neither given nor endorsed
by Amadego S.R.L.
Amadego S.R.L. does not accept liability for corruption, interception or
amendment, if any, or the consequences thereof.