Sorting Problem

Started by Tim Edwardsover 22 years ago41 messagesgeneral
Jump to latest
#1Tim Edwards
mor4321@hotmail.com

Currently I'm running PostgreSQL 7.2.3 and having a problem sorting.

I've got two colums of data, one Int4 one Varchar.

When I sort ASC on the varchar I get some strange results. Here a section of
data cut after running a sort. It starts with RM- then does RMT- Then goes
back for more RM-. I've checked for spaces or tabs or anything else odd and
cant find anything that would effect the sort.

63 RM-791
38 RM-AV2100
9 RM-PP401
10 RM-PP402
8 RM-PP404
3 RMT-D10
4 RMT-D108A
5 RMT-D109A
6 RMT-D116A
34 RMT-V402
40 RM-V10
41 RM-V11
45 RM-V12

Here's the sort string I'm using:

SELECT "remote_id", "remote_model" FROM "remote" ORDER BY "remote_model"
ASC;

Any idea or suggestions why it would do this ?

_________________________________________________________________
MSN 8 with e-mail virus protection service: 2 months FREE*
http://join.msn.com/?page=features/virus

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tim Edwards (#1)
Re: Sorting Problem

"Tim Edwards" <mor4321@hotmail.com> writes:

When I sort ASC on the varchar I get some strange results. Here a section of
data cut after running a sort. It starts with RM- then does RMT- Then goes
back for more RM-.

Sounds like you're in en_US locale, or at least something other than C
locale.

Unfortunately this can only be fixed by re-initdb'ing :-(

regards, tom lane

#3Kathy Zhu
Kathy.Zhu@Sun.COM
In reply to: Tim Edwards (#1)
Re: Sorting Problem

Do you mean that soring doesn't work for en_US locale ???
And, does encoding affect sorting at all ??

thanks,
kathy

Tom Lane wrote:

"Tim Edwards" <mor4321@hotmail.com> writes:

When I sort ASC on the varchar I get some strange results. Here a section of
data cut after running a sort. It starts with RM- then does RMT- Then goes
back for more RM-.

Sounds like you're in en_US locale, or at least something other than C
locale.

Unfortunately this can only be fixed by re-initdb'ing :-(

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

--

#4Maksim Likharev
mlikharev@aurigin.com
In reply to: Kathy Zhu (#3)
Re: Sorting Problem

en_US locale skips? punctuation from sorting index,
so in your case

RM-791
RM-AV2100
RM-PP401
RM-PP402
RM-PP404
RM-V10
RM-V11
RM-V12
RMT-D10
RMT-D108A
RMT-D109A
RMT-D116A
RMT-V402

==>

RM791
RMAV2100
RMPP401
RMPP402
RMPP404
RMV10
RMV11
RMV12
RMTD10
RMTD108A
RMTD109A
RMTD116A
RMTV402

-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 3:30 PM
To: Tom Lane
Cc: Tim Edwards; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem

Do you mean that soring doesn't work for en_US locale ???
And, does encoding affect sorting at all ??

thanks,
kathy

Tom Lane wrote:

"Tim Edwards" <mor4321@hotmail.com> writes:

When I sort ASC on the varchar I get some strange results. Here a

section of

data cut after running a sort. It starts with RM- then does RMT- Then

goes

back for more RM-.

Sounds like you're in en_US locale, or at least something other than C
locale.

Unfortunately this can only be fixed by re-initdb'ing :-(

regards, tom lane

---------------------------(end of

broadcast)---------------------------

TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to

majordomo@postgresql.org)

--

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if
your
joining column's datatypes do not match

#5Dennis Gearon
gearond@cvc.net
In reply to: Tom Lane (#2)
Re: Sorting Problem

To help my understanding of this type of thing, when he reinits the database, can he get the PG backend to be running with a different LOCALE than the machine's?

Tom Lane wrote:

Show quoted text

"Tim Edwards" <mor4321@hotmail.com> writes:

When I sort ASC on the varchar I get some strange results. Here a section of
data cut after running a sort. It starts with RM- then does RMT- Then goes
back for more RM-.

Sounds like you're in en_US locale, or at least something other than C
locale.

Unfortunately this can only be fixed by re-initdb'ing :-(

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dennis Gearon (#5)
Re: Sorting Problem

Dennis Gearon <gearond@cvc.net> writes:

To help my understanding of this type of thing, when he reinits the database, can he get the PG backend to be running with a different LOCALE than the machine's?

Yeah, he needs to set LANG and/or LC_ALL to be 'C' instead of whatever
it's defaulting to on his machine. Then run initdb with that
environment, and he's set.

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kathy Zhu (#3)
Re: Sorting Problem

Kathy zhu <Kathy.zhu@Sun.COM> writes:

Do you mean that soring doesn't work for en_US locale ???

Oh it works all right, it just doesn't agree with Tim's idea of what
sorted order is ;-)

regards, tom lane

#8Kathy Zhu
Kathy.Zhu@Sun.COM
In reply to: Maksim Likharev (#4)
Re: Sorting Problem

If it skips "-", then RMT-* should come before RM-V*, but they don't, why ??

Maksim Likharev wrote:

en_US locale skips? punctuation from sorting index,
so in your case

RM-791
RM-AV2100
RM-PP401
RM-PP402
RM-PP404
RM-V10
RM-V11
RM-V12
RMT-D10
RMT-D108A
RMT-D109A
RMT-D116A
RMT-V402

==>

RM791
RMAV2100
RMPP401
RMPP402
RMPP404
RMV10
RMV11
RMV12
RMTD10
RMTD108A
RMTD109A
RMTD116A
RMTV402

-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 3:30 PM
To: Tom Lane
Cc: Tim Edwards; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem

Do you mean that soring doesn't work for en_US locale ???
And, does encoding affect sorting at all ??

thanks,
kathy

Tom Lane wrote:

"Tim Edwards" <mor4321@hotmail.com> writes:

When I sort ASC on the varchar I get some strange results. Here a

section of

data cut after running a sort. It starts with RM- then does RMT- Then

goes

back for more RM-.

Sounds like you're in en_US locale, or at least something other than C
locale.

Unfortunately this can only be fixed by re-initdb'ing :-(

regards, tom lane

---------------------------(end of

broadcast)---------------------------

TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to

majordomo@postgresql.org)

--

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tim Edwards (#1)
Re: Sorting Problem

Dennis Gearon <gearond@cvc.net> writes:

You mean in his own local environment? So all his programs, console operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for Posgres specifically?

I mean he needs to run initdb with C as the selected locale. It has
nothing to do with what environment his other programs run in.

regards, tom lane

#10Dennis Gearon
gearond@cvc.net
In reply to: Tom Lane (#6)
Re: Sorting Problem

You mean in his own local environment? So all his programs, console operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for Posgres specifically?

Tom Lane wrote:

Show quoted text

Dennis Gearon <gearond@cvc.net> writes:

To help my understanding of this type of thing, when he reinits the database, can he get the PG backend to be running with a different LOCALE than the machine's?

Yeah, he needs to set LANG and/or LC_ALL to be 'C' instead of whatever
it's defaulting to on his machine. Then run initdb with that
environment, and he's set.

regards, tom lane

#11Dennis Gearon
gearond@cvc.net
In reply to: Tom Lane (#9)
Re: Sorting Problem

Danke, Spacibo, gracias, thanks.

Tom Lane wrote:

Show quoted text

Dennis Gearon <gearond@cvc.net> writes:

You mean in his own local environment? So all his programs, console operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for Posgres specifically?

I mean he needs to run initdb with C as the selected locale. It has
nothing to do with what environment his other programs run in.

regards, tom lane

#12Jeffrey Melloy
jmelloy@visualdistortion.org
In reply to: Kathy Zhu (#8)
Re: Sorting Problem

It does if you look at the original email. Maksim must've just
transposed a couple letters when he was writing his demo.

Jeff

Kathy zhu wrote:

Show quoted text

If it skips "-", then RMT-* should come before RM-V*, but they don't,
why ??

Maksim Likharev wrote:

en_US locale skips? punctuation from sorting index,
so in your case

RM-791
RM-AV2100
RM-PP401
RM-PP402
RM-PP404
RM-V10
RM-V11
RM-V12
RMT-D10
RMT-D108A
RMT-D109A
RMT-D116A
RMT-V402

==>

RM791
RMAV2100
RMPP401
RMPP402
RMPP404
RMV10
RMV11
RMV12
RMTD10
RMTD108A
RMTD109A
RMTD116A
RMTV402

-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 3:30 PM
To: Tom Lane
Cc: Tim Edwards; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem

Do you mean that soring doesn't work for en_US locale ???
And, does encoding affect sorting at all ??

thanks,
kathy

Tom Lane wrote:

"Tim Edwards" <mor4321@hotmail.com> writes:

When I sort ASC on the varchar I get some strange results. Here a

section of

data cut after running a sort. It starts with RM- then does RMT- Then

goes

back for more RM-.

Sounds like you're in en_US locale, or at least something other than C
locale.

Unfortunately this can only be fixed by re-initdb'ing :-(

regards, tom lane

---------------------------(end of

broadcast)---------------------------

TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to

majordomo@postgresql.org)

#13Maksim Likharev
mlikharev@aurigin.com
In reply to: Jeffrey Melloy (#12)
Re: Sorting Problem

and they come:

sorted asc:
RM-791
RM-AV2100
RM-PP401
RM-PP402
RM-PP404
RMT-D10
RMT-D108A
RMT-D109A
RMT-D116A
RMT-V402
RM-V10
RM-V11
RM-V12

sorted asc:
RM791
RMAV2100
RMPP401
RMPP402
RMPP404
RMTD10
RMTD108A
RMTD109A
RMTD116A
RMTV402
RMV10
RMV11
RMV12

I think I copied and pasted wrogn, sorry.

-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 4:08 PM
To: Maksim Likharev
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem

If it skips "-", then RMT-* should come before RM-V*, but they don't,
why ??

Maksim Likharev wrote:

en_US locale skips? punctuation from sorting index,
so in your case

RM-791
RM-AV2100
RM-PP401
RM-PP402
RM-PP404
RM-V10
RM-V11
RM-V12
RMT-D10
RMT-D108A
RMT-D109A
RMT-D116A
RMT-V402

==>

RM791
RMAV2100
RMPP401
RMPP402
RMPP404
RMV10
RMV11
RMV12
RMTD10
RMTD108A
RMTD109A
RMTD116A
RMTV402

-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 3:30 PM
To: Tom Lane
Cc: Tim Edwards; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem

Do you mean that soring doesn't work for en_US locale ???
And, does encoding affect sorting at all ??

thanks,
kathy

Tom Lane wrote:

"Tim Edwards" <mor4321@hotmail.com> writes:

When I sort ASC on the varchar I get some strange results. Here a

section of

data cut after running a sort. It starts with RM- then does RMT- Then

goes

back for more RM-.

Sounds like you're in en_US locale, or at least something other than C
locale.

Unfortunately this can only be fixed by re-initdb'ing :-(

regards, tom lane

---------------------------(end of

broadcast)---------------------------

TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to

majordomo@postgresql.org)

--

#14Kathy Zhu
Kathy.Zhu@Sun.COM
In reply to: Tim Edwards (#1)
Re: Sorting Problem

This brings up another question:

Say initdb with en_US locale, and we have localized strings for different
languages store in the db.

If we have a client in Germany, and want to see the text sorted in german. I
mean that we want do db soring for german strings and display the result in the
browser. How are we going to handle that ??

thanks,
kathy

Dennis Gearon wrote:

Danke, Spacibo, gracias, thanks.

Tom Lane wrote:

Dennis Gearon <gearond@cvc.net> writes:

You mean in his own local environment? So all his programs, console
operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for
Posgres specifically?

I mean he needs to run initdb with C as the selected locale. It has
nothing to do with what environment his other programs run in.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

--

#15Maksim Likharev
mlikharev@aurigin.com
In reply to: Kathy Zhu (#14)
Re: Sorting Problem

If you are talking about everything that lies under so called LATIN-1 (
ISO-8859-1 )
en_US encapsulates ( at least suppose to ) all those sorting rules,
do not remember about accents tho.

-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 4:43 PM
To: gearond@cvc.net
Cc: Tom Lane; Tim Edwards; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem

This brings up another question:

Say initdb with en_US locale, and we have localized strings for
different
languages store in the db.

If we have a client in Germany, and want to see the text sorted in
german. I
mean that we want do db soring for german strings and display the result
in the
browser. How are we going to handle that ??

thanks,
kathy

Dennis Gearon wrote:

Danke, Spacibo, gracias, thanks.

Tom Lane wrote:

Dennis Gearon <gearond@cvc.net> writes:

You mean in his own local environment? So all his programs, console
operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for
Posgres specifically?

I mean he needs to run initdb with C as the selected locale. It has
nothing to do with what environment his other programs run in.

regards, tom lane

---------------------------(end of

broadcast)---------------------------

TIP 7: don't forget to increase your free space map settings

--

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#16Kathy Zhu
Kathy.Zhu@Sun.COM
In reply to: Maksim Likharev (#15)
Re: Sorting Problem

Does including sorting rules for asian chars too ??
Do you know where I can find docuemt on this subject ??
thanks,
kathy

Maksim Likharev wrote:

Show quoted text

If you are talking about everything that lies under so called LATIN-1 (
ISO-8859-1 )
en_US encapsulates ( at least suppose to ) all those sorting rules,
do not remember about accents tho.

-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 4:43 PM
To: gearond@cvc.net
Cc: Tom Lane; Tim Edwards; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem

This brings up another question:

Say initdb with en_US locale, and we have localized strings for
different
languages store in the db.

If we have a client in Germany, and want to see the text sorted in
german. I
mean that we want do db soring for german strings and display the result
in the
browser. How are we going to handle that ??

thanks,
kathy

Dennis Gearon wrote:

Danke, Spacibo, gracias, thanks.

Tom Lane wrote:

Dennis Gearon <gearond@cvc.net> writes:

You mean in his own local environment? So all his programs, console
operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for
Posgres specifically?

I mean he needs to run initdb with C as the selected locale. It has
nothing to do with what environment his other programs run in.

regards, tom lane

---------------------------(end of

broadcast)---------------------------

TIP 7: don't forget to increase your free space map settings

#17Dennis Bjorklund
db@zigo.dhs.org
In reply to: Maksim Likharev (#15)
Re: Sorting Problem

On Tue, 12 Aug 2003, Maksim Likharev wrote:

If you are talking about everything that lies under so called LATIN-1 (
ISO-8859-1 ) en_US encapsulates ( at least suppose to ) all those
sorting rules, do not remember about accents tho.

It does not work like that. Different countries in europe that all use
latin1 use different sort order for the same characters. There is no way
to have one set of rules work for all. If we take swedish for example,
then v and w is treatead as the same letter when sorting. So in a
dictionary all words beginning with v or w is mixed together.

In the future we need indexes that depend on the locale (and a lot of
other changes).

--
/Dennis

#18Tim Edwards
mor4321@hotmail.com
In reply to: Dennis Bjorklund (#17)
Re: Sorting Problem

I was afraid of something like that. I found a sort term fix by using the
following:

select remote_model,
translate(remote_model, '-', 'a')
from remote
order by translate(remote_model, '-', 'a') asc

But it's an ugly answer to this, but it works for now. Guess I'm going to be
readin up on Locales

_________________________________________________________________
Help STOP SPAM with the new MSN 8 and get 2 months FREE*
http://join.msn.com/?page=features/junkmail

#19Dennis Gearon
gearond@cvc.net
In reply to: Dennis Bjorklund (#17)
Re: Sorting Problem

Dennis Bj�rklund wrote:

In the future we need indexes that depend on the locale (and a lot of other changes).

I agree. I've been looking at the web on this subject a lot lately. I am **NOT** a microslop fan, but SQL-SERVER even lets a user define a language(maybe encoding) down to the column level!

I've been reading on GNU-C and on languages, encoding, and localization.

http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html

There are three basic approaches to doing different langauges in computerized text:

A/ various adaptations of the 8 bit character set, I.E. the ISO-8859-x series.
One byte per character.
Easy storing, small size for a string.
Easy storing, if english characters, 100% efficient use of storage space.
Easy processing between applications, works well in the stream model of *nix
Easy processing in applications, a byte is a character.
Easy string handling, NOY NULL bytes in a string, except end of string.
NOT easy to know encoding from inherently in the document.
This is not the way of the future.

B/ wide characters
UTF16, UTF32, SHIFT-JIS-16, others
each character the same width, 2 or 4 bytes (2 bytes handles 99% of all languages)
Not so easy storing, if english characters, 50% to 75% loss of storage space.
Difficult processing between applications, does NOT work well in the stream model of *nix
Easy processing in applications, a set width of bits/bytes is a character.
Difficult string handling, MANY NULL bytes in a string, especially if in English.
Moderately easy to tell encoding/language in the document.
********This should be how Postgress stores data internally.********

C/ Multibyte characters
UTF8
variable width for different characters 1-5
Not so easy storing, if non english characters, 50% to 80% loss of storage space,
(in reality, most common western languages hover aournd 5-20% loss of storage space
most common non western languages hover aournd 40-60%% loss of storage space)
Easy processing between applications, works well in the stream model of *nix
Difficult processing in applications, a variable number of bytes is a character.
Easy string string handling, ONE NULL byte in a string.
Moderately easy to tell encoding/language in the document.
********This is how Postgress should default to sending data OUT of the application,
i.e. to the display or the web, or other system applications********

Show quoted text
#20Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Dennis Gearon (#19)
Re: Sorting Problem

On Wed, 13 Aug 2003, Dennis Gearon wrote:

Dennis Bj�rklund wrote:

In the future we need indexes that depend on the locale (and a lot of other changes).

I agree. I've been looking at the web on this subject a lot lately. I
am **NOT** a microslop fan, but SQL-SERVER even lets a user define a
language(maybe encoding) down to the column level!

I've been reading on GNU-C and on languages, encoding, and localization.

http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html

There are three basic approaches to doing different langauges in computerized text:

A/ various adaptations of the 8 bit character set, I.E. the ISO-8859-x series.
B/ wide characters
********This should be how Postgress stores data internally.********
C/ Multibyte characters
********This is how Postgress should default to sending data OUT of the application,
i.e. to the display or the web, or other system applications********

SQL has a system for defining character set specifications, collations and
such (per column/literal in some cases). We should probably look at it
before making decisions on how to do things.

#21Dennis Gearon
gearond@cvc.net
In reply to: Stephan Szabo (#20)
#22Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Dennis Gearon (#21)
#23Maksim Likharev
mlikharev@aurigin.com
In reply to: Stephan Szabo (#22)
#24Gianni Mariani
gianni@mariani.ws
In reply to: Dennis Gearon (#21)
#25Dennis Gearon
gearond@cvc.net
In reply to: Stephan Szabo (#22)
#26Dennis Gearon
gearond@cvc.net
In reply to: Maksim Likharev (#23)
#27Dennis Gearon
gearond@cvc.net
In reply to: Gianni Mariani (#24)
#28Bruce Momjian
bruce@momjian.us
In reply to: Dennis Gearon (#27)
#29Maksim Likharev
mlikharev@aurigin.com
In reply to: Bruce Momjian (#28)
#30Gianni Mariani
gianni@mariani.ws
In reply to: Dennis Gearon (#27)
#31Dennis Gearon
gearond@cvc.net
In reply to: Bruce Momjian (#28)
#32Dennis Gearon
gearond@cvc.net
In reply to: Gianni Mariani (#30)
#33Dennis Gearon
gearond@cvc.net
In reply to: Gianni Mariani (#30)
#34Gianni Mariani
gianni@mariani.ws
In reply to: Dennis Gearon (#32)
#35Kathy Zhu
Kathy.Zhu@Sun.COM
In reply to: Bruce Momjian (#28)
#36Dennis Gearon
gearond@cvc.net
In reply to: Kathy Zhu (#35)
#37Maksim Likharev
mlikharev@aurigin.com
In reply to: Dennis Gearon (#36)
#38Martijn van Oosterhout
kleptog@svana.org
In reply to: Dennis Gearon (#33)
#39Peter Eisentraut
peter_e@gmx.net
In reply to: Dennis Gearon (#25)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#39)
#41Dennis Gearon
gearond@cvc.net
In reply to: Peter Eisentraut (#39)