Sorting Problem
Currently I'm running PostgreSQL 7.2.3 and having a problem sorting.
I've got two colums of data, one Int4 one Varchar.
When I sort ASC on the varchar I get some strange results. Here a section of
data cut after running a sort. It starts with RM- then does RMT- Then goes
back for more RM-. I've checked for spaces or tabs or anything else odd and
cant find anything that would effect the sort.
63 RM-791
38 RM-AV2100
9 RM-PP401
10 RM-PP402
8 RM-PP404
3 RMT-D10
4 RMT-D108A
5 RMT-D109A
6 RMT-D116A
34 RMT-V402
40 RM-V10
41 RM-V11
45 RM-V12
Here's the sort string I'm using:
SELECT "remote_id", "remote_model" FROM "remote" ORDER BY "remote_model"
ASC;
Any idea or suggestions why it would do this ?
_________________________________________________________________
MSN 8 with e-mail virus protection service: 2 months FREE*
http://join.msn.com/?page=features/virus
"Tim Edwards" <mor4321@hotmail.com> writes:
When I sort ASC on the varchar I get some strange results. Here a section of
data cut after running a sort. It starts with RM- then does RMT- Then goes
back for more RM-.
Sounds like you're in en_US locale, or at least something other than C
locale.
Unfortunately this can only be fixed by re-initdb'ing :-(
regards, tom lane
Do you mean that soring doesn't work for en_US locale ???
And, does encoding affect sorting at all ??
thanks,
kathy
Tom Lane wrote:
"Tim Edwards" <mor4321@hotmail.com> writes:
When I sort ASC on the varchar I get some strange results. Here a section of
data cut after running a sort. It starts with RM- then does RMT- Then goes
back for more RM-.Sounds like you're in en_US locale, or at least something other than C
locale.Unfortunately this can only be fixed by re-initdb'ing :-(
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
--
en_US locale skips? punctuation from sorting index,
so in your case
RM-791
RM-AV2100
RM-PP401
RM-PP402
RM-PP404
RM-V10
RM-V11
RM-V12
RMT-D10
RMT-D108A
RMT-D109A
RMT-D116A
RMT-V402
==>
RM791
RMAV2100
RMPP401
RMPP402
RMPP404
RMV10
RMV11
RMV12
RMTD10
RMTD108A
RMTD109A
RMTD116A
RMTV402
-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 3:30 PM
To: Tom Lane
Cc: Tim Edwards; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem
Do you mean that soring doesn't work for en_US locale ???
And, does encoding affect sorting at all ??
thanks,
kathy
Tom Lane wrote:
"Tim Edwards" <mor4321@hotmail.com> writes:
When I sort ASC on the varchar I get some strange results. Here a
section of
data cut after running a sort. It starts with RM- then does RMT- Then
goes
back for more RM-.
Sounds like you're in en_US locale, or at least something other than C
locale.Unfortunately this can only be fixed by re-initdb'ing :-(
regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to
majordomo@postgresql.org)
--
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if
your
joining column's datatypes do not match
Import Notes
Resolved by subject fallback
To help my understanding of this type of thing, when he reinits the database, can he get the PG backend to be running with a different LOCALE than the machine's?
Tom Lane wrote:
Show quoted text
"Tim Edwards" <mor4321@hotmail.com> writes:
When I sort ASC on the varchar I get some strange results. Here a section of
data cut after running a sort. It starts with RM- then does RMT- Then goes
back for more RM-.Sounds like you're in en_US locale, or at least something other than C
locale.Unfortunately this can only be fixed by re-initdb'ing :-(
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Dennis Gearon <gearond@cvc.net> writes:
To help my understanding of this type of thing, when he reinits the database, can he get the PG backend to be running with a different LOCALE than the machine's?
Yeah, he needs to set LANG and/or LC_ALL to be 'C' instead of whatever
it's defaulting to on his machine. Then run initdb with that
environment, and he's set.
regards, tom lane
Kathy zhu <Kathy.zhu@Sun.COM> writes:
Do you mean that soring doesn't work for en_US locale ???
Oh it works all right, it just doesn't agree with Tim's idea of what
sorted order is ;-)
regards, tom lane
If it skips "-", then RMT-* should come before RM-V*, but they don't, why ??
Maksim Likharev wrote:
en_US locale skips? punctuation from sorting index,
so in your caseRM-791
RM-AV2100
RM-PP401
RM-PP402
RM-PP404
RM-V10
RM-V11
RM-V12
RMT-D10
RMT-D108A
RMT-D109A
RMT-D116A
RMT-V402==>
RM791
RMAV2100
RMPP401
RMPP402
RMPP404
RMV10
RMV11
RMV12
RMTD10
RMTD108A
RMTD109A
RMTD116A
RMTV402-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 3:30 PM
To: Tom Lane
Cc: Tim Edwards; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting ProblemDo you mean that soring doesn't work for en_US locale ???
And, does encoding affect sorting at all ??thanks,
kathyTom Lane wrote:
"Tim Edwards" <mor4321@hotmail.com> writes:
When I sort ASC on the varchar I get some strange results. Here a
section of
data cut after running a sort. It starts with RM- then does RMT- Then
goes
back for more RM-.
Sounds like you're in en_US locale, or at least something other than C
locale.Unfortunately this can only be fixed by re-initdb'ing :-(
regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" tomajordomo@postgresql.org)
--
Dennis Gearon <gearond@cvc.net> writes:
You mean in his own local environment? So all his programs, console operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for Posgres specifically?
I mean he needs to run initdb with C as the selected locale. It has
nothing to do with what environment his other programs run in.
regards, tom lane
Import Notes
Reply to msg id not found: 3F397658.3000603@cvc.net
You mean in his own local environment? So all his programs, console operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for Posgres specifically?
Tom Lane wrote:
Show quoted text
Dennis Gearon <gearond@cvc.net> writes:
To help my understanding of this type of thing, when he reinits the database, can he get the PG backend to be running with a different LOCALE than the machine's?
Yeah, he needs to set LANG and/or LC_ALL to be 'C' instead of whatever
it's defaulting to on his machine. Then run initdb with that
environment, and he's set.regards, tom lane
Danke, Spacibo, gracias, thanks.
Tom Lane wrote:
Show quoted text
Dennis Gearon <gearond@cvc.net> writes:
You mean in his own local environment? So all his programs, console operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for Posgres specifically?
I mean he needs to run initdb with C as the selected locale. It has
nothing to do with what environment his other programs run in.regards, tom lane
It does if you look at the original email. Maksim must've just
transposed a couple letters when he was writing his demo.
Jeff
Kathy zhu wrote:
Show quoted text
If it skips "-", then RMT-* should come before RM-V*, but they don't,
why ??Maksim Likharev wrote:
en_US locale skips? punctuation from sorting index,
so in your caseRM-791
RM-AV2100
RM-PP401
RM-PP402
RM-PP404
RM-V10
RM-V11
RM-V12
RMT-D10
RMT-D108A
RMT-D109A
RMT-D116A
RMT-V402==>
RM791
RMAV2100
RMPP401
RMPP402
RMPP404
RMV10
RMV11
RMV12
RMTD10
RMTD108A
RMTD109A
RMTD116A
RMTV402-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 3:30 PM
To: Tom Lane
Cc: Tim Edwards; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting ProblemDo you mean that soring doesn't work for en_US locale ???
And, does encoding affect sorting at all ??thanks,
kathyTom Lane wrote:
"Tim Edwards" <mor4321@hotmail.com> writes:
When I sort ASC on the varchar I get some strange results. Here a
section of
data cut after running a sort. It starts with RM- then does RMT- Then
goes
back for more RM-.
Sounds like you're in en_US locale, or at least something other than C
locale.Unfortunately this can only be fixed by re-initdb'ing :-(
regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" tomajordomo@postgresql.org)
and they come:
sorted asc:
RM-791
RM-AV2100
RM-PP401
RM-PP402
RM-PP404
RMT-D10
RMT-D108A
RMT-D109A
RMT-D116A
RMT-V402
RM-V10
RM-V11
RM-V12
sorted asc:
RM791
RMAV2100
RMPP401
RMPP402
RMPP404
RMTD10
RMTD108A
RMTD109A
RMTD116A
RMTV402
RMV10
RMV11
RMV12
I think I copied and pasted wrogn, sorry.
-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 4:08 PM
To: Maksim Likharev
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem
If it skips "-", then RMT-* should come before RM-V*, but they don't,
why ??
Maksim Likharev wrote:
en_US locale skips? punctuation from sorting index,
so in your caseRM-791
RM-AV2100
RM-PP401
RM-PP402
RM-PP404
RM-V10
RM-V11
RM-V12
RMT-D10
RMT-D108A
RMT-D109A
RMT-D116A
RMT-V402==>
RM791
RMAV2100
RMPP401
RMPP402
RMPP404
RMV10
RMV11
RMV12
RMTD10
RMTD108A
RMTD109A
RMTD116A
RMTV402-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 3:30 PM
To: Tom Lane
Cc: Tim Edwards; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting ProblemDo you mean that soring doesn't work for en_US locale ???
And, does encoding affect sorting at all ??thanks,
kathyTom Lane wrote:
"Tim Edwards" <mor4321@hotmail.com> writes:
When I sort ASC on the varchar I get some strange results. Here a
section of
data cut after running a sort. It starts with RM- then does RMT- Then
goes
back for more RM-.
Sounds like you're in en_US locale, or at least something other than C
locale.Unfortunately this can only be fixed by re-initdb'ing :-(
regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" tomajordomo@postgresql.org)
--
Import Notes
Resolved by subject fallback
This brings up another question:
Say initdb with en_US locale, and we have localized strings for different
languages store in the db.
If we have a client in Germany, and want to see the text sorted in german. I
mean that we want do db soring for german strings and display the result in the
browser. How are we going to handle that ??
thanks,
kathy
Dennis Gearon wrote:
Danke, Spacibo, gracias, thanks.
Tom Lane wrote:
Dennis Gearon <gearond@cvc.net> writes:
You mean in his own local environment? So all his programs, console
operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for
Posgres specifically?I mean he needs to run initdb with C as the selected locale. It has
nothing to do with what environment his other programs run in.regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
--
If you are talking about everything that lies under so called LATIN-1 (
ISO-8859-1 )
en_US encapsulates ( at least suppose to ) all those sorting rules,
do not remember about accents tho.
-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 4:43 PM
To: gearond@cvc.net
Cc: Tom Lane; Tim Edwards; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem
This brings up another question:
Say initdb with en_US locale, and we have localized strings for
different
languages store in the db.
If we have a client in Germany, and want to see the text sorted in
german. I
mean that we want do db soring for german strings and display the result
in the
browser. How are we going to handle that ??
thanks,
kathy
Dennis Gearon wrote:
Danke, Spacibo, gracias, thanks.
Tom Lane wrote:
Dennis Gearon <gearond@cvc.net> writes:
You mean in his own local environment? So all his programs, console
operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for
Posgres specifically?I mean he needs to run initdb with C as the selected locale. It has
nothing to do with what environment his other programs run in.regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
--
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
Import Notes
Resolved by subject fallback
Does including sorting rules for asian chars too ??
Do you know where I can find docuemt on this subject ??
thanks,
kathy
Maksim Likharev wrote:
Show quoted text
If you are talking about everything that lies under so called LATIN-1 (
ISO-8859-1 )
en_US encapsulates ( at least suppose to ) all those sorting rules,
do not remember about accents tho.-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Tuesday, August 12, 2003 4:43 PM
To: gearond@cvc.net
Cc: Tom Lane; Tim Edwards; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting ProblemThis brings up another question:
Say initdb with en_US locale, and we have localized strings for
different
languages store in the db.If we have a client in Germany, and want to see the text sorted in
german. I
mean that we want do db soring for german strings and display the result
in the
browser. How are we going to handle that ??thanks,
kathyDennis Gearon wrote:
Danke, Spacibo, gracias, thanks.
Tom Lane wrote:
Dennis Gearon <gearond@cvc.net> writes:
You mean in his own local environment? So all his programs, console
operations, etc, will have the new encoding? Or 'LANG/LC_ALL' for
Posgres specifically?I mean he needs to run initdb with C as the selected locale. It has
nothing to do with what environment his other programs run in.regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
On Tue, 12 Aug 2003, Maksim Likharev wrote:
If you are talking about everything that lies under so called LATIN-1 (
ISO-8859-1 ) en_US encapsulates ( at least suppose to ) all those
sorting rules, do not remember about accents tho.
It does not work like that. Different countries in europe that all use
latin1 use different sort order for the same characters. There is no way
to have one set of rules work for all. If we take swedish for example,
then v and w is treatead as the same letter when sorting. So in a
dictionary all words beginning with v or w is mixed together.
In the future we need indexes that depend on the locale (and a lot of
other changes).
--
/Dennis
I was afraid of something like that. I found a sort term fix by using the
following:
select remote_model,
translate(remote_model, '-', 'a')
from remote
order by translate(remote_model, '-', 'a') asc
But it's an ugly answer to this, but it works for now. Guess I'm going to be
readin up on Locales
_________________________________________________________________
Help STOP SPAM with the new MSN 8 and get 2 months FREE*
http://join.msn.com/?page=features/junkmail
Import Notes
Resolved by subject fallback
Dennis Bj�rklund wrote:
In the future we need indexes that depend on the locale (and a lot of other changes).
I agree. I've been looking at the web on this subject a lot lately. I am **NOT** a microslop fan, but SQL-SERVER even lets a user define a language(maybe encoding) down to the column level!
I've been reading on GNU-C and on languages, encoding, and localization.
http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html
There are three basic approaches to doing different langauges in computerized text:
A/ various adaptations of the 8 bit character set, I.E. the ISO-8859-x series.
One byte per character.
Easy storing, small size for a string.
Easy storing, if english characters, 100% efficient use of storage space.
Easy processing between applications, works well in the stream model of *nix
Easy processing in applications, a byte is a character.
Easy string handling, NOY NULL bytes in a string, except end of string.
NOT easy to know encoding from inherently in the document.
This is not the way of the future.
B/ wide characters
UTF16, UTF32, SHIFT-JIS-16, others
each character the same width, 2 or 4 bytes (2 bytes handles 99% of all languages)
Not so easy storing, if english characters, 50% to 75% loss of storage space.
Difficult processing between applications, does NOT work well in the stream model of *nix
Easy processing in applications, a set width of bits/bytes is a character.
Difficult string handling, MANY NULL bytes in a string, especially if in English.
Moderately easy to tell encoding/language in the document.
********This should be how Postgress stores data internally.********
C/ Multibyte characters
UTF8
variable width for different characters 1-5
Not so easy storing, if non english characters, 50% to 80% loss of storage space,
(in reality, most common western languages hover aournd 5-20% loss of storage space
most common non western languages hover aournd 40-60%% loss of storage space)
Easy processing between applications, works well in the stream model of *nix
Difficult processing in applications, a variable number of bytes is a character.
Easy string string handling, ONE NULL byte in a string.
Moderately easy to tell encoding/language in the document.
********This is how Postgress should default to sending data OUT of the application,
i.e. to the display or the web, or other system applications********
Show quoted text
On Wed, 13 Aug 2003, Dennis Gearon wrote:
Dennis Bj�rklund wrote:
In the future we need indexes that depend on the locale (and a lot of other changes).
I agree. I've been looking at the web on this subject a lot lately. I
am **NOT** a microslop fan, but SQL-SERVER even lets a user define a
language(maybe encoding) down to the column level!I've been reading on GNU-C and on languages, encoding, and localization.
http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.htmlThere are three basic approaches to doing different langauges in computerized text:
A/ various adaptations of the 8 bit character set, I.E. the ISO-8859-x series.
B/ wide characters
********This should be how Postgress stores data internally.********
C/ Multibyte characters
********This is how Postgress should default to sending data OUT of the application,
i.e. to the display or the web, or other system applications********
SQL has a system for defining character set specifications, collations and
such (per column/literal in some cases). We should probably look at it
before making decisions on how to do things.