Sorting regression of text function result since commit 586b98fdf1aae

Started by Jehan-Guillaume de Rorthaisabout 2 years ago3 messages

Hi,

A customer found what looks like a sort regression while testing his code from
v11 on a higher version. We hunt this regression down to commit 586b98fdf1aae,
introduced in v12.

Consider the following test case:

createdb -l fr_FR.utf8 -T template0 reg
psql reg <<<"
BEGIN;
CREATE TABLE IF NOT EXISTS reg
(
id bigint NOT NULL,
reg bytea NOT NULL
);

INSERT INTO reg VALUES
(1, convert_to( 'aaa', 'UTF8')),
(2, convert_to( 'aa}', 'UTF8'));

SELECT id FROM reg ORDER BY convert_from(reg, 'UTF8');"

In parent commit 68f6f2b7395fe, it results:

id
────
2
1

And in 586b98fdf1aae:

id
────
1
2

Looking at the plan, the sort node are different:

* 68f6f2b7395fe: Sort Key: (convert_from(reg, 'UTF8'::name))
* 586b98fdf1aae: Sort Key: (convert_from(reg, 'UTF8'::name)) COLLATE "C"

It looks like since 586b98fdf1aae, the result type collation of "convert_from"
is forced to "C", like the patch does for type "name", instead of the "default"
collation for type "text".

Looking at hints in the header comment of function "exprCollation", I poked
around and found that the result collation wrongly follow the input collation
in this case. With 586b98fdf1aae:

-- 2nd parameter type resolved as "name" so collation forced to "C"
SELECT id FROM reg ORDER BY convert_from(reg, 'UTF8');
-- 1
-- 2

-- Collation of 2nd parameter is forced to something else
SELECT id FROM reg ORDER BY convert_from(reg, 'UTF8' COLLATE \"default\");
-- 2
-- 1
-- Sort
-- Sort Key: (convert_from(reg, 'UTF8'::name COLLATE "default"))
-- -> Seq Scan on reg

It seems because the second parameter type is "name", the result collation
become "C" instead of being the collation associated with "text" type:
"default".

I couldn't find anything explaining this behavior in the changelog. It looks
like a regression to me, but if this is actually expected, maybe this deserve
some documentation patch?

Regards,

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jehan-Guillaume de Rorthais (#1)
Re: Sorting regression of text function result since commit 586b98fdf1aae

Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes:

It looks like since 586b98fdf1aae, the result type collation of "convert_from"
is forced to "C", like the patch does for type "name", instead of the "default"
collation for type "text".

Well, convert_from() inherits its result collation from the input,
per the normal rules for collation assignment [1]https://www.postgresql.org/docs/current/collation.html#COLLATION-CONCEPTS.

Looking at hints in the header comment of function "exprCollation", I poked
around and found that the result collation wrongly follow the input collation
in this case.

It's not "wrong", it's what the SQL standard requires.

I couldn't find anything explaining this behavior in the changelog. It looks
like a regression to me, but if this is actually expected, maybe this deserve
some documentation patch?

The v12 release notes do say

Type name now behaves much like a domain over type text that has
default collation “C”.

You'd have similar results from an expression involving such a domain,
I believe.

I'm less than excited about patching the v12 release notes four
years later. Maybe, if this point had come up in a more timely
fashion, we'd have mentioned it --- but it's hardly possible to
cover every potential implication of such a change in the
release notes.

regards, tom lane

[1]: https://www.postgresql.org/docs/current/collation.html#COLLATION-CONCEPTS

In reply to: Tom Lane (#2)
Re: Sorting regression of text function result since commit 586b98fdf1aae

On Mon, 11 Dec 2023 15:43:12 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes:

It looks like since 586b98fdf1aae, the result type collation of
"convert_from" is forced to "C", like the patch does for type "name",
instead of the "default" collation for type "text".

Well, convert_from() inherits its result collation from the input,
per the normal rules for collation assignment [1].

Looking at hints in the header comment of function "exprCollation", I poked
around and found that the result collation wrongly follow the input
collation in this case.

It's not "wrong", it's what the SQL standard requires.

Mh, OK. This is at least a surprising behavior. Having a non-data related
argument impacting the result collation seems counter-intuitive. But I
understand this is by standard, no need to discuss it.

I couldn't find anything explaining this behavior in the changelog. It looks
like a regression to me, but if this is actually expected, maybe this
deserve some documentation patch?

The v12 release notes do say

Type name now behaves much like a domain over type text that has
default collation “C”.

Sure, and I saw it, but reading at this entry, I couldn't guess this could have
such implication on text result from a function call. That's why I hunt for
the precise commit and was surprise to find this was the actual change.

You'd have similar results from an expression involving such a domain,
I believe.

I'm less than excited about patching the v12 release notes four
years later. Maybe, if this point had come up in a more timely
fashion, we'd have mentioned it --- but it's hardly possible to
cover every potential implication of such a change in the
release notes.

This could have been documented in the collation concept page, as a trap to be
aware of. A link from the release note to such a small paragraph would have
been enough to warn devs this might have implications when mixed with other
collatable types. But I understand we can not document all the traps paving the
way to the standard anyway.

Thank you for your explanation!

Regards,