BUG #15651: Collation setting en_US.utf8 breaking sort order
The following bug has been logged on the website:
Bug reference: 15651
Logged by: Kaleb Akalework
Email address: kaleb.akalework@asg.com
PostgreSQL version: 11.0
Operating system: Linux
Description:
I have PostgresSQL database on Windows. I created database with Collation of
en_US.utf8. Then I created table (The steps to reproduce are below). I
inserted a few rows into this table one of which was row with special
characters "~!@#$^&(". The insert worked fine but then when I do a select on
the column for values >=' ' (Space), I get back all the rows except for the
row that contains
"~!@#$^&(" . In the UTF8 table
https://www.utf8-chartable.de/unicode-utf8-table.pl, I can see that space is
the lowest printable character so technically every printable character
showed be greater than space but it isn't in this case. I create another
database and set collation to 'C'. then the same select query returns
"~!@#$^&(". The problem is only apparent in a Linux environment where
postgreSQL is running and the database has a collation of en_US.utf8. Can
someone help please?
create table test (
name_c varchar(14)
)
insert into test (name_c) values ('AAA')
insert into test (name_c) values ('BAA')
insert into test (name_c) values ('CAA')
insert into test (name_c) values ('DAA')
insert into test (name_c) values ('~!@#$^&(')
select * from test where name_c >= ' '
PG Bug reporting form <noreply@postgresql.org> writes:
I have PostgresSQL database on Windows. I created database with Collation of
en_US.utf8.
Really? AFAIK, Windows doesn't support collation names that look like
that.
Then I created table (The steps to reproduce are below). I
inserted a few rows into this table one of which was row with special
characters "~!@#$^&(". The insert worked fine but then when I do a select on
the column for values >=' ' (Space), I get back all the rows except for the
row that contains
"~!@#$^&(" .
This appears to be the intended behavior of en_US sorting.
On a Linux machine I can reproduce it outside Postgres:
$ LANG=C sort stuff.txt
AAA
BAA
CAA
DAA
~!@#$^&(
$ LANG=en_US sort stuff.txt
~!@#$^&(
AAA
BAA
CAA
DAA
(The first line in my test file contains one space.)
regards, tom lane
--> Really? AFAIK, Windows doesn't support collation names that look like that.
I meant to say Linux. Not Windows.
-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Friday, February 22, 2019 1:03 PM
To: Kaleb Akalework <kaleb.akalework@asg.com>
Cc: pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #15651: Collation setting en_US.utf8 breaking sort order
*** External email: Verify sender before opening attachments or links ***
PG Bug reporting form <noreply@postgresql.org> writes:
I have PostgresSQL database on Windows. I created database with
Collation of en_US.utf8.
Really? AFAIK, Windows doesn't support collation names that look like that.
Then I created table (The steps to reproduce are below). I inserted a
few rows into this table one of which was row with special characters
"~!@#$^&(". The insert worked fine but then when I do a select on the
column for values >=' ' (Space), I get back all the rows except for
the row that contains "~!@#$^&(" .
This appears to be the intended behavior of en_US sorting.
On a Linux machine I can reproduce it outside Postgres:
$ LANG=C sort stuff.txt
AAA
BAA
CAA
DAA
~!@#$^&(
$ LANG=en_US sort stuff.txt
~!@#$^&(
AAA
BAA
CAA
DAA
(The first line in my test file contains one space.)
regards, tom lane
On Fri, Feb 22, 2019 at 10:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
This appears to be the intended behavior of en_US sorting.
Right. UCA style algorithms tend to give the least weight of all to
whitespace characters. They're tertiary weight, if memory serves.
--
Peter Geoghegan
Ok so if this is intended behavior of UTF8 then I understand. My last question then would be if I use a collation setting of C, does it mean I won't be able to support multiple languages?
-----Original Message-----
From: Peter Geoghegan <pg@bowt.ie>
Sent: Friday, February 22, 2019 1:41 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Kaleb Akalework <kaleb.akalework@asg.com>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>
Subject: Re: BUG #15651: Collation setting en_US.utf8 breaking sort order
*** External email: Verify sender before opening attachments or links ***
On Fri, Feb 22, 2019 at 10:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
This appears to be the intended behavior of en_US sorting.
Right. UCA style algorithms tend to give the least weight of all to whitespace characters. They're tertiary weight, if memory serves.
--
Peter Geoghegan
Kaleb Akalework <kaleb.akalework@asg.com> writes:
Ok so if this is intended behavior of UTF8 then I understand. My last question then would be if I use a collation setting of C, does it mean I won't be able to support multiple languages?
Collation only determines sort order, I believe.
regards, tom lane
On 2019-02-22 19:03, Tom Lane wrote:
$ LANG=en_US sort stuff.txt
~!@#$^&(AAA
BAA
CAA
DAA
With ICU (COLLATE "und-x-icu"), I get the line with the space first. I
took a bit of a look around the various Unicode documents and I don't
find anything that would defend the glibc behavior.
<obscure detail>
However, since some of those special characters are variable collating
elements and some are not, there might well be an explanation.
</obscure detail>
So, maybe try ICU.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut wrote:
With ICU (COLLATE "und-x-icu"), I get the line with the space first. I
took a bit of a look around the various Unicode documents and I don't
find anything that would defend the glibc behavior.
The glibc result is now version-dependent. With glibc-2.28,
on Debian buster:
buster$ LC_COLLATE=en_US.utf8 sort stuff.txt
~!@#$^&(
AAA
BAA
CAA
buster$ apt-cache show libc-bin | grep Version
Version: 2.28-7
As opposed to the current Debian stable, with glibc-2.24:
stretch$ LC_COLLATE=en_US.utf8 sort stuff.txt
~!@#$^&(
AAA
BAA
CAA
DAA
stretch$ apt-cache show libc-bin|grep Version
Version: 2.24-11+deb9u3
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
Kaleb Akalework wrote:
Ok so if this is intended behavior of UTF8 then I understand. My last
question then would be if I use a collation setting of C, does it mean I
won't be able to support multiple languages?
You seem to want to the sort order of C, but be aware that you might
have to decide whether you want this:
=> select upper('é' collate "C");
upper
-------
é
(1 row)
or that:
=> select upper('é' collate "en_US");
upper
-------
É
(1 row)
To get the sort order of C but the interpretation of characters closer
to what you'd expect from Unicode, it's possible for the database
to have LC_COLLATE to "C", and LC_CTYPE to, say, en_US.UTF-8.
See CREATE DATABASE.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite