sort order (ORDER BY) hu_HU.UTF-8 locale with UTF-8 encoding is not working correctly on FreeBSD
Dear List,
I'm trying to solve this problem for more than 2 days, but I have no
other idea.
My system is:
FreeBSD titanium 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49
UTC 2010 root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64
Tested PgSQL versions are: 8.4 and 9.0.3 (fresh install using ports)
initdb flags are: --encoding=UTF-8 --locale=hu_HU.UTF-8
Initdb output:
The files belonging to this database system will be owned by user "pgsql".
This user must also own the server process.
The database cluster will be initialized with locale hu_HU.UTF-8.
The default text search configuration will be set to "hungarian".
creating directory /usr/local/pgsql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 40
selecting default shared_buffers ... 28MB
creating configuration files ... ok
creating template1 database in /usr/local/pgsql/data/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
step 1: createdb -U wooh test --encoding=UTF-8 --locale=hu_HU.UTF-8
step 2: psql -U wooh -l
List of databases
Name | Owner | Encoding | Collation | Ctype | Access
privileges
-----------+-------+----------+-------------+-------------+-------------------
postgres | pgsql | UTF8 | hu_HU.UTF-8 | hu_HU.UTF-8 |
template0 | pgsql | UTF8 | hu_HU.UTF-8 | hu_HU.UTF-8 | =c/pgsql
+
| | | | | pgsql=CTc/pgsql
template1 | pgsql | UTF8 | hu_HU.UTF-8 | hu_HU.UTF-8 | =c/pgsql
+
| | | | | pgsql=CTc/pgsql
test | wooh | UTF8 | hu_HU.UTF-8 | hu_HU.UTF-8 |
Everything seems to be OK.
let's create a table with idn, and name.
step 3:
test=# create table test (idn serial primary key not null, name
varchar(255));
NOTICE: CREATE TABLE will create implicit sequence "test_idn_seq" for
serial column "test.idn"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
"test_pkey" for table "test"
CREATE TABLE
step 4:
[root@titanium /home/wooh]# file insert.sql
insert.sql: UTF-8 Unicode text
step 5:
[root@titanium /home/wooh]# cat insert.sql
INSERT INTO test (name) values ( 'a' );
INSERT INTO test (name) values ( 'ďż˝' );
INSERT INTO test (name) values ( 'b' );
INSERT INTO test (name) values ( 'c' );
INSERT INTO test (name) values ( 'd' );
INSERT INTO test (name) values ( 'e' );
INSERT INTO test (name) values ( 'ďż˝' );
INSERT INTO test (name) values ( 'f' );
INSERT INTO test (name) values ( 'g' );
INSERT INTO test (name) values ( 'h' );
step 6:
[root@titanium /home/wooh]# psql -U wooh test < insert.sql
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
step 7:
[root@titanium /home/wooh]# psql -U wooh test
psql (9.0.3)
Type "help" for help.
test=# select * from test order by name asc;
idn | name
-----+------
1 | a
3 | b
4 | c
5 | d
6 | e
8 | f
9 | g
10 | h
2 | ďż˝
7 | ďż˝
(10 rows)
test=# show lc_collate;
lc_collate
-------------
hu_HU.UTF-8
(1 row)
test=# show lc_ctype;
lc_ctype
-------------
hu_HU.UTF-8
(1 row)
The sort order is incorrect, and I cannot find out why. Everything seems
to be Ok. (locale, collate, encoding)
But this whole thing works if I chose "LATIN2" encoding with
"hu_HU.ISO8859-2" locale. Unfortunately we use UTF-8 databases, so I
need to use the UTF8 Encoding with hu_HU.UTF-8
Any ideas?
Additional information:
[root@titanium /home/wooh]# pg_controldata /usr/local/pgsql/data/
pg_control version number: 903
Catalog version number: 201008051
Database system identifier: 5570517093231621070
Database cluster state: in production
pg_control last modified: Sun Feb 6 11:19:27 2011
Latest checkpoint location: 0/1610CA0
Prior checkpoint location: 0/15F8028
Latest checkpoint's REDO location: 0/1610CA0
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/675
Latest checkpoint's NextOID: 24576
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 654
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Time of latest checkpoint: Sun Feb 6 11:19:19 2011
Minimum recovery ending location: 0/0
Backup start location: 0/0
Current wal_level setting: minimal
Current max_connections setting: 40
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
--
Adam PAPAI
BSD Support Service
http://www.bsdsupportservice.hu
E-mail: adam.papai@bsdsupportservice.hu
Phone: +36 30 33-55-735 (Hungary)
Adam PAPAI wrote:
The correct encoding of the values are:
INSERT INTO test (name) values ( 'a' );
INSERT INTO test (name) values ( 'á' );
INSERT INTO test (name) values ( 'b' );
INSERT INTO test (name) values ( 'c' );
INSERT INTO test (name) values ( 'd' );
INSERT INTO test (name) values ( 'e' );
INSERT INTO test (name) values ( 'é' );
INSERT INTO test (name) values ( 'f' );
INSERT INTO test (name) values ( 'g' );
INSERT INTO test (name) values ( 'h' );
I forgot to change the encoding of my mail client.
test=# select * from test order by name asc;
idn | name
-----+------
1 | a
3 | b
4 | c
5 | d
6 | e
8 | f
9 | g
10 | h
2 | á
7 | é
(10 rows)
--
Adam PAPAI
BSD Support Service
http://www.bsdsupportservice.hu
E-mail: adam.papai@bsdsupportservice.hu
Phone: +36 30 33-55-735 (Hungary)
Adam PAPAI <adam.papai@bsdsupportservice.hu> writes:
[ hu_HU.UTF-8 locale doesn't sort as expected on ]
FreeBSD titanium 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49
UTC 2010 root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64
Do you get the expected sort ordering from the "sort" command either?
I'm betting not. This is not something Postgres controls -- we just
rely on the results of strcoll().
I don't know the state of affairs on FreeBSD for sure, but I do know
that the UTF-8 locales are fairly broken on Mac OS X, which is a BSD
derived platform too. It looks like nobody in BSD-land has bothered to
put in the work to make multibyte characters sort properly.
You might have better luck with hu_HU using a suitable LATINn character
set instead of UTF8. If that still doesn't do what you want, you might
have no choice but to move off BSD :-(
regards, tom lane
Tom Lane wrote:
Adam PAPAI <adam.papai@bsdsupportservice.hu> writes:
[ hu_HU.UTF-8 locale doesn't sort as expected on ]
FreeBSD titanium 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49
UTC 2010 root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64Do you get the expected sort ordering from the "sort" command either?
I'm betting not. This is not something Postgres controls -- we just
rely on the results of strcoll().I don't know the state of affairs on FreeBSD for sure, but I do know
that the UTF-8 locales are fairly broken on Mac OS X, which is a BSD
derived platform too. It looks like nobody in BSD-land has bothered to
put in the work to make multibyte characters sort properly.You might have better luck with hu_HU using a suitable LATINn character
set instead of UTF8. If that still doesn't do what you want, you might
have no choice but to move off BSD :-(regards, tom lane
Hi Tom,
I've tested the "sort" command. It fails too.
Perhaps it's really with the *BSD and not with PostgreSQL.
Thanks,
--
Adam PAPAI
BSD Support Service
http://www.bsdsupportservice.hu
E-mail: adam.papai@bsdsupportservice.hu
Phone: +36 30 33-55-735 (Hungary)
On sön, 2011-02-06 at 11:06 -0500, Tom Lane wrote:
I don't know the state of affairs on FreeBSD for sure, but I do know
that the UTF-8 locales are fairly broken on Mac OS X, which is a BSD
derived platform too. It looks like nobody in BSD-land has bothered
to put in the work to make multibyte characters sort properly.
Yeah, same on FreeBSD. Using LATINn is the right choice there.
Adam PAPAI <adam.papai@bsdsupportservice.hu> writes:
FreeBSD titanium 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49
UTC 2010 root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64Tested PgSQL versions are: 8.4 and 9.0.3 (fresh install using ports)
Did you try ICU patch? IMHO it can help with your problem. But i am not sure
how it is stable.
# cd /usr/ports/databases/postgresql84-server
# make config
┌────────────────────────────────────────────────────────────────────┐
│ Options for postgresql-server 8.4.5_2 │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │[X] NLS Use internationalized messages │ │
│ │[ ] PAM Build with PAM support (server only) │ │
│ │[ ] LDAP Build with LDAP authentication support │ │
│ │[ ] MIT_KRB5 Build with MIT's kerberos support │ │
│ │[ ] HEIMDAL_KRB5 Builds with Heimdal kerberos support │ │
│ │[ ] OPTIMIZED_CFLAGS Builds with compiler optimizations (-O3) │ │
│ │[X] XML Build with XML data type (server) │ │
│ │[X] TZDATA Use internal timezone database (server) │ │
│ │[ ] DEBUG Builds with debugging symbols │ │
│ │[X] GSSAPI BUild with GSSAPI support │ │
==>│ │[ ] ICU Use ICU for unicode collation (server) │ │
│ │[X] INTDATE Builds with 64-bit date/time type (server)│ │
│ │ │ │
│ │ │ │
│ │ │ │
├─└────────────────────────────────────────────────────────────────┘─┤
│ [ OK ] Cancel │
└────────────────────────────────────────────────────────────────────┘
--
Sergey Burladyan