Very puzzling sort behavior

Started by Ken Tanzerover 10 years ago13 messagesgeneral
Jump to latest
#1Ken Tanzer
ken.tanzer@gmail.com

Hi. In a table that includes these columns:

my_db=> \d tbl_client
...
name_last | character varying(40) | not null
name_first | character varying(30) | not null
...

I am extremely puzzled by the sorting of the "CLARKE"s in this list:

my_db=> SELECT name_last,length(name_last),name_last='CLARK' AS
clark,name_last='CLARKE' AS clarke FROM tbl_client WHERE name_last ILIKE
'CLARK%' OR name_last ILIKE 'ADAMS%' ORDER BY name_last || ', ' ||
name_first;
name_last | length | clark | clarke
-----------+--------+-------+--------
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMSON | 7 | f | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARKE | 6 | f | t
CLARKE | 6 | f | t
CLARKE | 6 | f | t
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
(17 rows)

The ADAMS are included just to show a similar example is ordering
correctly. I put the length and equality test columns in to try to make
sure there weren't some bizarre characters in the data. This is only
happening on one particular database. I did a reindex on the table just
for good measure. If I remove the name_first piece of the ORDER BY (which
doesn't seem like it should matter), it sorts as expected:

my_db=> SELECT name_last,length(name_last),name_last='CLARK' AS
clark,name_last='CLARKE' AS clarke FROM tbl_client WHERE name_last ILIKE
'CLARK%' OR name_last ILIKE 'ADAMS%' ORDER BY name_last || ', '

;
name_last | length | clark | clarke
-----------+--------+-------+--------
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMSON | 7 | f | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARKE | 6 | f | t
CLARKE | 6 | f | t
CLARKE | 6 | f | t
(17 rows)

I tried selecting those 17 rows from tbl_client into a new table, and get
the same odd behavior. However, if I run with new data I get an expected
order:

CREATE TEMP TABLE test (
name_first VARCHAR(40),
name_last VARCHAR(30)

);

INSERT INTO test VALUES ('JOE','CLARKE'),('BILL','CLARK');

SELECT * FROM test ORDER BY name_last;
SELECT * FROM test ORDER BY name_last || ', ' || name_first;

Any thoughts about what's going on, what to do about it, or what obvious
point I missing? Thanks in advance!

my_db=> SELECT version();
version

----------------------------------------------------------------------------------------------------------------
PostgreSQL 9.2.13 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7
20120313 (Red Hat 4.4.7-11), 64-bit
(1 row)

Ken

--
AGENCY Software
A Free Software data system
By and for non-profits
*http://agency-software.org/ <http://agency-software.org/&gt;*
*https://agency-software.org/demo/client
<https://agency-software.org/demo/client&gt;*
ken.tanzer@agency-software.org
(253) 245-3801

Subscribe to the mailing list
<agency-general-request@lists.sourceforge.net?body=subscribe> to
learn more about AGENCY or
follow the discussion.

In reply to: Ken Tanzer (#1)
Re: Very puzzling sort behavior

On Thu, Sep 10, 2015 at 12:35 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:

Any thoughts about what's going on, what to do about it, or what obvious point I missing? Thanks in advance!

This is the expected behavior. Locale rules will weigh the punctuation
character and space you added after primary alphabetical ordering.
Therefore, the 'E' in 'CLARKE' is more or less compared with the first
character of the first name in the case that you show, because those
code points are both what Unicode calls "primary weights".

You're not displaying the string you're ordering by anyway, so you can
just not concatenate the ', ' string (so name_last || name_first), and
it will work as you expect, I believe.

--
Regards,
Peter Geoghegan

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Steve Crawford
scrawford@pinpointresearch.com
In reply to: Ken Tanzer (#1)
Re: Very puzzling sort behavior

Any null values in first name??

-Steve

On Thu, Sep 10, 2015 at 12:35 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:

Show quoted text

Hi. In a table that includes these columns:

my_db=> \d tbl_client
...
name_last | character varying(40) | not null
name_first | character varying(30) | not null
...

I am extremely puzzled by the sorting of the "CLARKE"s in this list:

my_db=> SELECT name_last,length(name_last),name_last='CLARK' AS
clark,name_last='CLARKE' AS clarke FROM tbl_client WHERE name_last ILIKE
'CLARK%' OR name_last ILIKE 'ADAMS%' ORDER BY name_last || ', ' ||
name_first;
name_last | length | clark | clarke
-----------+--------+-------+--------
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMSON | 7 | f | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARKE | 6 | f | t
CLARKE | 6 | f | t
CLARKE | 6 | f | t
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
(17 rows)

The ADAMS are included just to show a similar example is ordering
correctly. I put the length and equality test columns in to try to make
sure there weren't some bizarre characters in the data. This is only
happening on one particular database. I did a reindex on the table just
for good measure. If I remove the name_first piece of the ORDER BY (which
doesn't seem like it should matter), it sorts as expected:

my_db=> SELECT name_last,length(name_last),name_last='CLARK' AS
clark,name_last='CLARKE' AS clarke FROM tbl_client WHERE name_last ILIKE
'CLARK%' OR name_last ILIKE 'ADAMS%' ORDER BY name_last || ', '

;
name_last | length | clark | clarke
-----------+--------+-------+--------
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMS | 5 | f | f
ADAMSON | 7 | f | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARK | 5 | t | f
CLARKE | 6 | f | t
CLARKE | 6 | f | t
CLARKE | 6 | f | t
(17 rows)

I tried selecting those 17 rows from tbl_client into a new table, and get
the same odd behavior. However, if I run with new data I get an expected
order:

CREATE TEMP TABLE test (
name_first VARCHAR(40),
name_last VARCHAR(30)

);

INSERT INTO test VALUES ('JOE','CLARKE'),('BILL','CLARK');

SELECT * FROM test ORDER BY name_last;
SELECT * FROM test ORDER BY name_last || ', ' || name_first;

Any thoughts about what's going on, what to do about it, or what obvious
point I missing? Thanks in advance!

my_db=> SELECT version();
version

----------------------------------------------------------------------------------------------------------------
PostgreSQL 9.2.13 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
4.4.7 20120313 (Red Hat 4.4.7-11), 64-bit
(1 row)

Ken

--
AGENCY Software
A Free Software data system
By and for non-profits
*http://agency-software.org/ <http://agency-software.org/&gt;*
*https://agency-software.org/demo/client
<https://agency-software.org/demo/client&gt;*
ken.tanzer@agency-software.org
(253) 245-3801

Subscribe to the mailing list
<agency-general-request@lists.sourceforge.net?body=subscribe> to
learn more about AGENCY or
follow the discussion.

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#2)
Re: Very puzzling sort behavior

Peter Geoghegan <peter.geoghegan86@gmail.com> writes:

On Thu, Sep 10, 2015 at 12:35 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:

Any thoughts about what's going on, what to do about it, or what obvious point I missing? Thanks in advance!

This is the expected behavior.

If you don't like it, sort in C locale ...

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Ken Tanzer
ken.tanzer@gmail.com
In reply to: Tom Lane (#4)
Re: Very puzzling sort behavior

On Thu, Sep 10, 2015 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <peter.geoghegan86@gmail.com> writes:

On Thu, Sep 10, 2015 at 12:35 PM, Ken Tanzer <ken.tanzer@gmail.com>

wrote:

Any thoughts about what's going on, what to do about it, or what

obvious point I missing? Thanks in advance!

This is the expected behavior.

If you don't like it, sort in C locale ...

regards, tom lane

OK, can one of you help me out in understanding this? I would have thought
that given "CLARK," and "CLARKE" that the comma would get compared against
the E and come first. End of story, before we even get to anything farther
in the string. What am I missing?

--
AGENCY Software
A Free Software data system
By and for non-profits
*http://agency-software.org/ <http://agency-software.org/&gt;*
*https://agency-software.org/demo/client
<https://agency-software.org/demo/client&gt;*
ken.tanzer@agency-software.org
(253) 245-3801

Subscribe to the mailing list
<agency-general-request@lists.sourceforge.net?body=subscribe> to
learn more about AGENCY or
follow the discussion.

#6Ken Tanzer
ken.tanzer@gmail.com
In reply to: Ken Tanzer (#5)
Re: Very puzzling sort behavior

Alright never mind, I guess I see what's going on. Thanks!

Ken

On Thu, Sep 10, 2015 at 12:51 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:

On Thu, Sep 10, 2015 at 12:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <peter.geoghegan86@gmail.com> writes:

On Thu, Sep 10, 2015 at 12:35 PM, Ken Tanzer <ken.tanzer@gmail.com>

wrote:

Any thoughts about what's going on, what to do about it, or what

obvious point I missing? Thanks in advance!

This is the expected behavior.

If you don't like it, sort in C locale ...

regards, tom lane

OK, can one of you help me out in understanding this? I would have
thought that given "CLARK," and "CLARKE" that the comma would get compared
against the E and come first. End of story, before we even get to anything
farther in the string. What am I missing?

--
AGENCY Software
A Free Software data system
By and for non-profits
*http://agency-software.org/ <http://agency-software.org/&gt;*
*https://agency-software.org/demo/client
<https://agency-software.org/demo/client&gt;*
ken.tanzer@agency-software.org
(253) 245-3801

Subscribe to the mailing list
<agency-general-request@lists.sourceforge.net?body=subscribe> to
learn more about AGENCY or
follow the discussion.

--
AGENCY Software
A Free Software data system
By and for non-profits
*http://agency-software.org/ <http://agency-software.org/&gt;*
*https://agency-software.org/demo/client
<https://agency-software.org/demo/client&gt;*
ken.tanzer@agency-software.org
(253) 245-3801

Subscribe to the mailing list
<agency-general-request@lists.sourceforge.net?body=subscribe> to
learn more about AGENCY or
follow the discussion.

In reply to: Ken Tanzer (#5)
Re: Very puzzling sort behavior

On Thu, Sep 10, 2015 at 12:51 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:

OK, can one of you help me out in understanding this? I would have thought that given "CLARK," and "CLARKE" that the comma would get compared against the E and come first. End of story, before we even get to anything farther in the string. What am I missing?

That's only how it works with the C locale. Otherwise, there are
complicated rules to weigh things like space and punctuation (and
accents/diacritics) less prominently than primary alphabetical
ordering. This is often useful. Anyway, based on what you say here, I
think you should actually "ORDER BY name_last, name_first".

--
Regards,
Peter Geoghegan

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Ken Tanzer
ken.tanzer@gmail.com
In reply to: Peter Geoghegan (#7)
Re: Very puzzling sort behavior

On Thu, Sep 10, 2015 at 12:56 PM, Peter Geoghegan <
peter.geoghegan86@gmail.com> wrote:

On Thu, Sep 10, 2015 at 12:51 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:

OK, can one of you help me out in understanding this? I would have

thought that given "CLARK," and "CLARKE" that the comma would get compared
against the E and come first. End of story, before we even get to anything
farther in the string. What am I missing?

That's only how it works with the C locale. Otherwise, there are
complicated rules to weigh things like space and punctuation (and
accents/diacritics) less prominently than primary alphabetical
ordering. This is often useful. Anyway, based on what you say here, I
think you should actually "ORDER BY name_last, name_first".

--
Regards,
Peter Geoghegan

Thanks. A little more help would be appreciated. First a little context:

What I mailed out what a boiled down example. In reality, what I have is a
ton of tables with a client_id in them, and a convenience function
client_name(client_id) that returns the name_last, name_first string (plus
an alias if it exists). client_name is used all over the place in both
views and in an app that uses the database. There is a similar, also
heavily used, staff_name function. Eliminating the use of these functions
is a non-starter for me--I'd much rather live with the existing sort
behavior, especially as no one has ever pointed this out despite over a
decade of use.

I'm hoping to sort change this behavior with as minimal a change as
possible (e.g., minimal potential for unexpected side effects or
breakage). I was hoping to just add a COLLATE "C" within the function:

CREATE OR REPLACE FUNCTION client_name( cid int4 ) RETURNS text AS $$
-- client is a view that draws from tbl_client. name_full is the
field with the name_last, name_first data in it
SELECT name_full COLLATE "C" FROM client WHERE client_id=$1;
$$

but that seems to have no effect. And sure enough the documentation seems
to back that up. ("The collation assigned to a function or operator's
combined input expressions is also considered to apply to the function or
operator's result, if the function or operator delivers a result of a
collatable data type.") So this may be wishful thinking, but is there any
other way to specify the collation of a function result? Specifying the
collation every time the function is used is likely a no-go for me too.

Alternatively, it seems I could create new databases with a C collation and
then move my data into them. This seems a bit drastic, although possible.
I'd again be worried about the breakage/side effects. And actually, will
this work? (i.e., can you use pg_dump to populate a new database with a
different locale?)

Are there any other potential solutions, pitfalls or considerations that
come to mind? Any thoughts welcome. And as I said, if there's not a good
way to do this I'll probably leave it alone. Thanks.

Ken

--
AGENCY Software
A Free Software data system
By and for non-profits
*http://agency-software.org/ <http://agency-software.org/&gt;*
*https://agency-software.org/demo/client
<https://agency-software.org/demo/client&gt;*
ken.tanzer@agency-software.org
(253) 245-3801

Subscribe to the mailing list
<agency-general-request@lists.sourceforge.net?body=subscribe> to
learn more about AGENCY or
follow the discussion.

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Ken Tanzer (#8)
Re: Very puzzling sort behavior

Ken Tanzer wrote:

Are there any other potential solutions, pitfalls or considerations that
come to mind? Any thoughts welcome. And as I said, if there's not a good
way to do this I'll probably leave it alone.

In part, it boils down to what you use the in ORDER BY clause. If you
concatenate the last name and first name, they will be considered as a
single string and run afoul of funny behavior of dictionary sorting,
which ignores non-alphanumeric chars in the first pass. But if you keep
them separate by using "ORDER BY last_name, first_name" then sorting
will consider the last name separately from first name, and you'd get
the results you want (I think).

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Ken Tanzer
ken.tanzer@gmail.com
In reply to: Alvaro Herrera (#9)
Re: Very puzzling sort behavior

On Thu, Sep 10, 2015 at 2:02 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Ken Tanzer wrote:

Are there any other potential solutions, pitfalls or considerations that
come to mind? Any thoughts welcome. And as I said, if there's not a

good

way to do this I'll probably leave it alone.

In part, it boils down to what you use the in ORDER BY clause. If you
concatenate the last name and first name, they will be considered as a
single string and run afoul of funny behavior of dictionary sorting,
which ignores non-alphanumeric chars in the first pass. But if you keep
them separate by using "ORDER BY last_name, first_name" then sorting
will consider the last name separately from first name, and you'd get
the results you want (I think).

Thanks, but I guess I should have been clearer. Thanks to y'all wonderful
mailing list folks, I get it now as to why the two sorts are not the same.
I'm hoping for practical suggestions or advice about how to get C locale
sorting without having to rewrite all my existing queries. I still might
be clinging to futile hope, but is there really no way to specify a
collation for the return value of a function? And are there any pitfalls
associated with pg_dump-ing into a new database with a different collation?

Cheers,
Ken

--
AGENCY Software
A Free Software data system
By and for non-profits
*http://agency-software.org/ <http://agency-software.org/&gt;*
*https://agency-software.org/demo/client
<https://agency-software.org/demo/client&gt;*
ken.tanzer@agency-software.org
(253) 245-3801

Subscribe to the mailing list
<agency-general-request@lists.sourceforge.net?body=subscribe> to
learn more about AGENCY or
follow the discussion.

#11Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Ken Tanzer (#10)
Re: Very puzzling sort behavior

On Thu, Sep 10, 2015 at 02:54:31PM -0700, Ken Tanzer wrote:

Thanks, but I guess I should have been clearer. Thanks to y'all wonderful
mailing list folks, I get it now as to why the two sorts are not the same.
I'm hoping for practical suggestions or advice about how to get C locale
sorting without having to rewrite all my existing queries.

Why not change the collation for the column?
http://www.postgresql.org/docs/9.4/interactive/sql-altertable.html

be clinging to futile hope, but is there really no way to specify a
collation for the return value of a function?

I don't believe so. I think you need to specify the collation for the
data itself.

It strikes me that you might be able to similate this with a
materialized view or something like that, which has a different
collation than the source table. That seems like it'd be pretty
awkward, but if there's some reason you can't use C collation on the
source table itself that might work. I guess you could do the same
thing with a temporary table inside the function, thereby getting a
different collation on the data than in the source table. These both
seem like hacks, but if you need a bugfix prior to a real solution
they'd give you a path.

A

--
Andrew Sullivan
ajs@crankycanuck.ca

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12Joe Conway
mail@joeconway.com
In reply to: Andrew Sullivan (#11)
Re: Very puzzling sort behavior

On 09/10/2015 03:03 PM, Andrew Sullivan wrote:

On Thu, Sep 10, 2015 at 02:54:31PM -0700, Ken Tanzer wrote:

Thanks, but I guess I should have been clearer. Thanks to y'all wonderful
mailing list folks, I get it now as to why the two sorts are not the same.
I'm hoping for practical suggestions or advice about how to get C locale
sorting without having to rewrite all my existing queries.

Why not change the collation for the column?
http://www.postgresql.org/docs/9.4/interactive/sql-altertable.html

be clinging to futile hope, but is there really no way to specify a
collation for the return value of a function?

I don't believe so. I think you need to specify the collation for the
data itself.

I have not read the entire thread, and not sure I am following the issue
correctly, but isn't this what you want?

8<---------------
create table t2(a text);
insert into t2 values('abc,def'),('abcd');

show lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)

select a from t2 order by a;
a
---------
abcd
abc,def
(2 rows)

select a from t2 order by a COLLATE "C";
a
---------
abc,def
abcd
(2 rows)
8<---------------

HTH,

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

#13Ken Tanzer
ken.tanzer@gmail.com
In reply to: Andrew Sullivan (#11)
Re: Very puzzling sort behavior

On Thu, Sep 10, 2015 at 3:03 PM, Andrew Sullivan <ajs@crankycanuck.ca>
wrote:

On Thu, Sep 10, 2015 at 02:54:31PM -0700, Ken Tanzer wrote:

Thanks, but I guess I should have been clearer. Thanks to y'all

wonderful

mailing list folks, I get it now as to why the two sorts are not the

same.

I'm hoping for practical suggestions or advice about how to get C locale
sorting without having to rewrite all my existing queries.

Why not change the collation for the column?
http://www.postgresql.org/docs/9.4/interactive/sql-altertable.html

Hmm... that sounds promising. And less drastic than rewriting the whole
database. I'll have to ponder that one, and thanks for the suggestion!

Cheers,
Ken