Sorting regression of text function result since commit 586b98fdf1aae
Hi,
A customer found what looks like a sort regression while testing his code from
v11 on a higher version. We hunt this regression down to commit 586b98fdf1aae,
introduced in v12.
Consider the following test case:
createdb -l fr_FR.utf8 -T template0 reg
psql reg <<<"
BEGIN;
CREATE TABLE IF NOT EXISTS reg
(
id bigint NOT NULL,
reg bytea NOT NULL
);
INSERT INTO reg VALUES
(1, convert_to( 'aaa', 'UTF8')),
(2, convert_to( 'aa}', 'UTF8'));
SELECT id FROM reg ORDER BY convert_from(reg, 'UTF8');"
In parent commit 68f6f2b7395fe, it results:
id
────
2
1
And in 586b98fdf1aae:
id
────
1
2
Looking at the plan, the sort node are different:
* 68f6f2b7395fe: Sort Key: (convert_from(reg, 'UTF8'::name))
* 586b98fdf1aae: Sort Key: (convert_from(reg, 'UTF8'::name)) COLLATE "C"
It looks like since 586b98fdf1aae, the result type collation of "convert_from"
is forced to "C", like the patch does for type "name", instead of the "default"
collation for type "text".
Looking at hints in the header comment of function "exprCollation", I poked
around and found that the result collation wrongly follow the input collation
in this case. With 586b98fdf1aae:
-- 2nd parameter type resolved as "name" so collation forced to "C"
SELECT id FROM reg ORDER BY convert_from(reg, 'UTF8');
-- 1
-- 2
-- Collation of 2nd parameter is forced to something else
SELECT id FROM reg ORDER BY convert_from(reg, 'UTF8' COLLATE \"default\");
-- 2
-- 1
-- Sort
-- Sort Key: (convert_from(reg, 'UTF8'::name COLLATE "default"))
-- -> Seq Scan on reg
It seems because the second parameter type is "name", the result collation
become "C" instead of being the collation associated with "text" type:
"default".
I couldn't find anything explaining this behavior in the changelog. It looks
like a regression to me, but if this is actually expected, maybe this deserve
some documentation patch?
Regards,
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes:
It looks like since 586b98fdf1aae, the result type collation of "convert_from"
is forced to "C", like the patch does for type "name", instead of the "default"
collation for type "text".
Well, convert_from() inherits its result collation from the input,
per the normal rules for collation assignment [1]https://www.postgresql.org/docs/current/collation.html#COLLATION-CONCEPTS.
Looking at hints in the header comment of function "exprCollation", I poked
around and found that the result collation wrongly follow the input collation
in this case.
It's not "wrong", it's what the SQL standard requires.
I couldn't find anything explaining this behavior in the changelog. It looks
like a regression to me, but if this is actually expected, maybe this deserve
some documentation patch?
The v12 release notes do say
Type name now behaves much like a domain over type text that has
default collation “C”.
You'd have similar results from an expression involving such a domain,
I believe.
I'm less than excited about patching the v12 release notes four
years later. Maybe, if this point had come up in a more timely
fashion, we'd have mentioned it --- but it's hardly possible to
cover every potential implication of such a change in the
release notes.
regards, tom lane
[1]: https://www.postgresql.org/docs/current/collation.html#COLLATION-CONCEPTS
On Mon, 11 Dec 2023 15:43:12 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes:
It looks like since 586b98fdf1aae, the result type collation of
"convert_from" is forced to "C", like the patch does for type "name",
instead of the "default" collation for type "text".Well, convert_from() inherits its result collation from the input,
per the normal rules for collation assignment [1].Looking at hints in the header comment of function "exprCollation", I poked
around and found that the result collation wrongly follow the input
collation in this case.It's not "wrong", it's what the SQL standard requires.
Mh, OK. This is at least a surprising behavior. Having a non-data related
argument impacting the result collation seems counter-intuitive. But I
understand this is by standard, no need to discuss it.
I couldn't find anything explaining this behavior in the changelog. It looks
like a regression to me, but if this is actually expected, maybe this
deserve some documentation patch?The v12 release notes do say
Type name now behaves much like a domain over type text that has
default collation “C”.
Sure, and I saw it, but reading at this entry, I couldn't guess this could have
such implication on text result from a function call. That's why I hunt for
the precise commit and was surprise to find this was the actual change.
You'd have similar results from an expression involving such a domain,
I believe.I'm less than excited about patching the v12 release notes four
years later. Maybe, if this point had come up in a more timely
fashion, we'd have mentioned it --- but it's hardly possible to
cover every potential implication of such a change in the
release notes.
This could have been documented in the collation concept page, as a trap to be
aware of. A link from the release note to such a small paragraph would have
been enough to warn devs this might have implications when mixed with other
collatable types. But I understand we can not document all the traps paving the
way to the standard anyway.
Thank you for your explanation!
Regards,