lc_collate issue

Started by Cody Pistoover 18 years ago10 messagesgeneral
Jump to latest
#1Cody Pisto
cpisto@rvweb.com

Hi All,

I'm looking for any kind of a reason (and potential workarounds), be it
bug or otherwise, why the following two queries produce different
results under a database encoding of UTF8 and lc_collate of en_US.UTF-8:

SELECT x FROM (SELECT 'Something else' AS x UNION SELECT '-SOMETHING
ELSE-' AS x UNION SELECT 'Somethang' AS x) y ORDER BY LOWER(x)

x
------------------
Somethang
-SOMETHING ELSE-
Something else
(3 rows)

*AND*

SELECT x FROM (SELECT 'Somethingelse' AS x UNION SELECT
'-SOMETHINGELSE-' AS x UNION SELECT 'Somethang' AS x) y ORDER BY LOWER(x)

x
-----------------
Somethang
Somethingelse
-SOMETHINGELSE-

The removal of spaces from the strings gives "more correct" sorting
results, with the spaces and '-' characters, '-SOMETHING ELSE-' is
strangely sorted in the middle?
It does not matter if you use LOWER or UPPER, and the "problem" does not
occur on databases with encoding SQL_ASCII and lc_collate of C

I have tested this on Postgres 8.1.9, 8.2, 8.2.4 with database encoding
of UTF8 and lc_collate of en_US.UTF8
and on 7.4.16 with database encoding of SQL_ASCII and lc_collate of C

Thank in advance for any consideration!

-Cody

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Cody Pisto (#1)
Re: lc_collate issue

Cody Pisto <cpisto@rvweb.com> writes:

I'm looking for any kind of a reason (and potential workarounds), be it
bug or otherwise, why the following two queries produce different
results under a database encoding of UTF8 and lc_collate of en_US.UTF-8:

That's just how it is in most non-C locales --- they use some weird
algorithm that's alleged to approximate what dictionary makers
traditionally do with phrases. I don't recall the details but there's
something about multiple passes with spaces being ignored in earlier
passes. You'll find that sort(1) sorts these lines the same way.

If you don't like it, use C locale, or put together your own locale
definition. (No, I'm not sure how hard that is ...)

regards, tom lane

#3Cody Pisto
cpisto@rvweb.com
In reply to: Tom Lane (#2)
Re: lc_collate issue

Hi Tom,

Thanks for answering,

I pretty much assumed that was the case (whatever library postgres is
using for encoding is causing the "issue")

The[my] problem is, it just seems like completely incorrect behavior..

The quickest and dirtiest examples I can do are that both python and
mysql sort these 3 example strings (in utf8 encoding) the way I would
expect (like a C locale)

python:

x = [unicode("Somethang", "utf8"), unicode("-SOMETHING ELSE-",

"utf8"), unicode("Something else", "utf8")]

x

[u'Somethang', u'-SOMETHING ELSE-', u'Something else']

x.sort()
x

[u'-SOMETHING ELSE-', u'Somethang', u'Something else']

mysql:

mysql> SET NAMES 'utf8';
Query OK, 0 rows affected (0.02 sec)

mysql> SELECT x FROM (SELECT 'Something else' AS x UNION SELECT
'-SOMETHING ELSE-' AS x UNION SELECT 'Somethang' AS x) y ORDER BY LOWER(x);
+------------------+
| x |
+------------------+
| -SOMETHING ELSE- |
| Somethang |
| Something else |
+------------------+
3 rows in set (0.00 sec)

postgres:

SELECT x FROM (SELECT 'Something else' AS x UNION SELECT '-SOMETHING
ELSE-' AS x UNION SELECT 'Somethang' AS x) y ORDER BY LOWER(x);
x
------------------
Somethang
-SOMETHING ELSE-
Something else
(3 rows)

And I bet oracle, firebird, sqlite, mssql, and everything else out there
that does utf8 would return it in the "right" order (I'm willing to test
that too if needed..)

If this is potentially a problem in postgres somewhere, point me in the
general direction and I'm more than willing to fix it myself..

Thanks for your consideration..

-Cody

Tom Lane wrote:

Cody Pisto <cpisto@rvweb.com> writes:

I'm looking for any kind of a reason (and potential workarounds), be it
bug or otherwise, why the following two queries produce different
results under a database encoding of UTF8 and lc_collate of en_US.UTF-8:

That's just how it is in most non-C locales --- they use some weird
algorithm that's alleged to approximate what dictionary makers
traditionally do with phrases. I don't recall the details but there's
something about multiple passes with spaces being ignored in earlier
passes. You'll find that sort(1) sorts these lines the same way.

If you don't like it, use C locale, or put together your own locale
definition. (No, I'm not sure how hard that is ...)

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--

*Cody Pisto*
Redzia RVs
10555 Montgomery NE
Suite 80
Albuquerque, NM 87111
Phone: (866) 844-1986
Fax: (866) 204-4403

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Cody Pisto (#3)
Re: lc_collate issue

Cody Pisto <cpisto@rvweb.com> writes:

If this is potentially a problem in postgres somewhere, point me in the
general direction and I'm more than willing to fix it myself..

You seem not to have absorbed what I said. This *is* the correct result
according to that locale's definition of sorting. You can demonstrate
that without any use of Postgres:

[tgl@rh2 ~]$ cat fooey
Somethang
-SOMETHING ELSE-
Something else
[tgl@rh2 ~]$ LANG=C sort fooey
-SOMETHING ELSE-
Somethang
Something else
[tgl@rh2 ~]$ LANG=en_US sort fooey
Somethang
Something else
-SOMETHING ELSE-
[tgl@rh2 ~]$

If you prefer C sort ordering, run Postgres in C locale. It's as
simple as that.

regards, tom lane

#5Cody Pisto
cpisto@rvweb.com
In reply to: Tom Lane (#4)
Re: lc_collate issue

Hi Tom,

I did understand what you said, I apologize that it came out otherwise.

I'm just looking for the correct workaround.

If initdb was done with a C locale, and thus lc_collate and friends
where all C, but the database and client encoding was set to UTF-8,
would postgres convert data on the fly from UTF-8(storage) to ASCII for
sorting or would things just blow up when a >1 byte character hit the mix?

The docs say bad things would happen:

http://www.postgresql.org/docs/8.2/static/multibyte.html
*Important: * Although you can specify any encoding you want for a
database, it is unwise to choose an encoding that is not what is
expected by the locale you have selected. The LC_COLLATE and LC_CTYPE
settings imply a particular encoding, and locale-dependent operations
(such as sorting) are likely to misinterpret data that is in an
incompatible encoding.

Right now for me ORDER BY LOWER(ASCII(column)), LOWER(column) (or some
variation there of) works, but is there a better workaround?

Thanks,

-Cody

Tom Lane wrote:

Cody Pisto <cpisto@rvweb.com> writes:

If this is potentially a problem in postgres somewhere, point me in the
general direction and I'm more than willing to fix it myself..

You seem not to have absorbed what I said. This *is* the correct result
according to that locale's definition of sorting. You can demonstrate
that without any use of Postgres:

[tgl@rh2 ~]$ cat fooey
Somethang
-SOMETHING ELSE-
Something else
[tgl@rh2 ~]$ LANG=C sort fooey
-SOMETHING ELSE-
Somethang
Something else
[tgl@rh2 ~]$ LANG=en_US sort fooey
Somethang
Something else
-SOMETHING ELSE-
[tgl@rh2 ~]$

If you prefer C sort ordering, run Postgres in C locale. It's as
simple as that.

regards, tom lane

--

*Cody Pisto*
Redzia RVs
10555 Montgomery NE
Suite 80
Albuquerque, NM 87111
Phone: (866) 844-1986
Fax: (866) 204-4403

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Cody Pisto (#5)
Re: lc_collate issue

Cody Pisto <cpisto@rvweb.com> writes:

If initdb was done with a C locale, and thus lc_collate and friends
where all C, but the database and client encoding was set to UTF-8,
would postgres convert data on the fly from UTF-8(storage) to ASCII for
sorting or would things just blow up when a >1 byte character hit the mix?

No, C locale just sorts the bytes. It won't "blow up". Whether it will
give you a sort ordering you like for multibyte characters is a
different question.

regards, tom lane

#7Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#6)
Re: lc_collate issue

Cody Pisto <cpisto@rvweb.com> writes:

If initdb was done with a C locale, and thus lc_collate and friends
where all C, but the database and client encoding was set to UTF-8,
would postgres convert data on the fly from UTF-8(storage) to ASCII for
sorting or would things just blow up when a >1 byte character hit the mix?

No, C locale just sorts the bytes. It won't "blow up". Whether it will
give you a sort ordering you like for multibyte characters is a
different question.

Yup.

For example, LATIN1 part of UTF-8 (UNICODE) is physicaly ordered same
as ISO 8859-1. So if you see the order of ISO 8859-1 is "natural",
then the sort order of UTF-8 is ok as well. However the order of CJK
part of UTF-8 is totally different from the original charcater sets
(almost random), you need to use convert() for converting UTF-8 to
original encoding to get "natural" sort order. I don't think you are
interested in CJK part, though.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

#8Peter Eisentraut
peter_e@gmx.net
In reply to: Cody Pisto (#5)
Re: lc_collate issue

Cody Pisto wrote:

I'm just looking for the correct workaround.

The canonically correct workaround it to define your own locale.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

#9Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Cody Pisto (#5)
Re: lc_collate issue

would postgres convert data on the fly from UTF-8(storage) to ASCII for
sorting

That ain't possible, it seems, or else we wouldn't need UTF-8.

Karsten
--
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail

#10Dennis Bjorklund
db@zigo.dhs.org
In reply to: Cody Pisto (#5)
Re: lc_collate issue

I'm just looking for the correct workaround.

While adding a new collation is the "correct" solution it's a lot of work.
Even then pg can't use different collations anyway unless you reinit
the datadir using initdb.

One workaround is to cast the text value into a bytea value, and then it
will be sorted in byte order no matter what locale you have. Like this:

SELECT *
FROM foo
ORDER BY CAST (some_column AS BYTEA);

This work except that there is no cast from text to bytea. But we can add
one using:

CREATE CAST (text AS bytea) WITHOUT FUNCTION AS ASSIGNMENT;

I can't promise that WITHOUT FUNCTION will always work but currently bytea
and text values are stored in the same way so it should work (and it
probably will in future versions as well).

/Dennis