BUG #15651: Collation setting en_US.utf8 breaking sort order

Started by PG Bug reporting formover 7 years ago9 messagesbugs

noreply@postgresql.org

over 7 years ago

The following bug has been logged on the website:

Bug reference: 15651
Logged by: Kaleb Akalework
Email address: kaleb.akalework@asg.com
PostgreSQL version: 11.0
Operating system: Linux
Description:

I have PostgresSQL database on Windows. I created database with Collation of
en_US.utf8. Then I created table (The steps to reproduce are below). I
inserted a few rows into this table one of which was row with special
characters "~!@#$^&(". The insert worked fine but then when I do a select on
the column for values >=' ' (Space), I get back all the rows except for the
row that contains
"~!@#$^&(" . In the UTF8 table
https://www.utf8-chartable.de/unicode-utf8-table.pl, I can see that space is
the lowest printable character so technically every printable character
showed be greater than space but it isn't in this case. I create another
database and set collation to 'C'. then the same select query returns
"~!@#$^&(". The problem is only apparent in a Linux environment where
postgreSQL is running and the database has a collation of en_US.utf8. Can
someone help please?

create table test (
name_c varchar(14)
)

insert into test (name_c) values ('AAA')
insert into test (name_c) values ('BAA')
insert into test (name_c) values ('CAA')
insert into test (name_c) values ('DAA')
insert into test (name_c) values ('~!@#$^&(')

select * from test where name_c >= ' '

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #15651: Collation setting en_US.utf8 breaking sort order

PG Bug reporting form <noreply@postgresql.org> writes:

I have PostgresSQL database on Windows. I created database with Collation of
en_US.utf8.

Really? AFAIK, Windows doesn't support collation names that look like
that.

Then I created table (The steps to reproduce are below). I
inserted a few rows into this table one of which was row with special
characters "~!@#$^&(". The insert worked fine but then when I do a select on
the column for values >=' ' (Space), I get back all the rows except for the
row that contains
"~!@#$^&(" .

This appears to be the intended behavior of en_US sorting.
On a Linux machine I can reproduce it outside Postgres:

$ LANG=C sort stuff.txt

AAA
BAA
CAA
DAA
~!@#$^&(
$ LANG=en_US sort stuff.txt
~!@#$^&(

AAA
BAA
CAA
DAA

(The first line in my test file contains one space.)

regards, tom lane

Kaleb Akalework

kaleb.akalework@asg.com

over 7 years ago

In reply to: Tom Lane (#2)

RE: BUG #15651: Collation setting en_US.utf8 breaking sort order

--> Really? AFAIK, Windows doesn't support collation names that look like that.
I meant to say Linux. Not Windows.

-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Friday, February 22, 2019 1:03 PM
To: Kaleb Akalework <kaleb.akalework@asg.com>
Cc: pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #15651: Collation setting en_US.utf8 breaking sort order

*** External email: Verify sender before opening attachments or links ***

PG Bug reporting form <noreply@postgresql.org> writes:

I have PostgresSQL database on Windows. I created database with
Collation of en_US.utf8.

Really? AFAIK, Windows doesn't support collation names that look like that.

Then I created table (The steps to reproduce are below). I inserted a
few rows into this table one of which was row with special characters
"~!@#$^&(". The insert worked fine but then when I do a select on the
column for values >=' ' (Space), I get back all the rows except for
the row that contains "~!@#$^&(" .

This appears to be the intended behavior of en_US sorting.
On a Linux machine I can reproduce it outside Postgres:

$ LANG=C sort stuff.txt

AAA
BAA
CAA
DAA
~!@#$^&(
$ LANG=en_US sort stuff.txt
~!@#$^&(

AAA
BAA
CAA
DAA

(The first line in my test file contains one space.)

regards, tom lane

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Tom Lane (#2)

Re: BUG #15651: Collation setting en_US.utf8 breaking sort order

On Fri, Feb 22, 2019 at 10:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This appears to be the intended behavior of en_US sorting.

Right. UCA style algorithms tend to give the least weight of all to
whitespace characters. They're tertiary weight, if memory serves.

--
Peter Geoghegan

Kaleb Akalework

kaleb.akalework@asg.com

over 7 years ago

In reply to: Peter Geoghegan (#4)

RE: BUG #15651: Collation setting en_US.utf8 breaking sort order

Ok so if this is intended behavior of UTF8 then I understand. My last question then would be if I use a collation setting of C, does it mean I won't be able to support multiple languages?

-----Original Message-----
From: Peter Geoghegan <pg@bowt.ie>
Sent: Friday, February 22, 2019 1:41 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Kaleb Akalework <kaleb.akalework@asg.com>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>
Subject: Re: BUG #15651: Collation setting en_US.utf8 breaking sort order

*** External email: Verify sender before opening attachments or links ***

On Fri, Feb 22, 2019 at 10:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This appears to be the intended behavior of en_US sorting.

Right. UCA style algorithms tend to give the least weight of all to whitespace characters. They're tertiary weight, if memory serves.

--
Peter Geoghegan

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Kaleb Akalework (#5)

Re: BUG #15651: Collation setting en_US.utf8 breaking sort order

Kaleb Akalework <kaleb.akalework@asg.com> writes:

Ok so if this is intended behavior of UTF8 then I understand. My last question then would be if I use a collation setting of C, does it mean I won't be able to support multiple languages?

Collation only determines sort order, I believe.

regards, tom lane

Peter Eisentraut

peter_e@gmx.net

over 7 years ago

In reply to: Tom Lane (#2)

Re: BUG #15651: Collation setting en_US.utf8 breaking sort order

On 2019-02-22 19:03, Tom Lane wrote:

$ LANG=en_US sort stuff.txt
~!@#$^&(

AAA
BAA
CAA
DAA

With ICU (COLLATE "und-x-icu"), I get the line with the space first. I
took a bit of a look around the various Unicode documents and I don't
find anything that would defend the glibc behavior.

<obscure detail>
However, since some of those special characters are variable collating
elements and some are not, there might well be an explanation.
</obscure detail>

So, maybe try ICU.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Daniel Verite

daniel@manitou-mail.org

over 7 years ago

In reply to: Peter Eisentraut (#7)

Re: BUG #15651: Collation setting en_US.utf8 breaking sort order

Peter Eisentraut wrote:

With ICU (COLLATE "und-x-icu"), I get the line with the space first. I
took a bit of a look around the various Unicode documents and I don't
find anything that would defend the glibc behavior.

The glibc result is now version-dependent. With glibc-2.28,
on Debian buster:

buster$ LC_COLLATE=en_US.utf8 sort stuff.txt

~!@#$^&(
AAA
BAA
CAA

buster$ apt-cache show libc-bin | grep Version
Version: 2.28-7

As opposed to the current Debian stable, with glibc-2.24:

stretch$ LC_COLLATE=en_US.utf8 sort stuff.txt
~!@#$^&(

AAA
BAA
CAA
DAA

stretch$ apt-cache show libc-bin|grep Version
Version: 2.24-11+deb9u3

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Daniel Verite

daniel@manitou-mail.org

over 7 years ago

In reply to: Kaleb Akalework (#5)

RE: BUG #15651: Collation setting en_US.utf8 breaking sort order

Kaleb Akalework wrote:

Ok so if this is intended behavior of UTF8 then I understand. My last
question then would be if I use a collation setting of C, does it mean I
won't be able to support multiple languages?

You seem to want to the sort order of C, but be aware that you might
have to decide whether you want this:

=> select upper('é' collate "C");
upper
-------
é
(1 row)

or that:

=> select upper('é' collate "en_US");
upper
-------
É
(1 row)

To get the sort order of C but the interpretation of characters closer
to what you'd expect from Unicode, it's possible for the database
to have LC_COLLATE to "C", and LC_CTYPE to, say, en_US.UTF-8.
See CREATE DATABASE.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite