Nondeterministic collations and the value returned by GROUP BY x

Started by Jim Finnertyalmost 5 years ago4 messages
#1Jim Finnerty
jfinnert@amazon.com

PostgreSQL 12 and onward supports nondeterministic collations. For "GROUP
BY x", which value of 'x' will PostgreSQL return in this case? The first
value of x?

The SQL standard (section 8.2) states that the specific value returned is
implementation-defined, but requires that the value must be one of the
specific values in the set of values that compare equally:

d) Depending on the collation, two strings may compare as equal even if they
are of different lengths or contain different sequences of characters. When
any of the operations MAX, MIN, and DISTINCT reference a grouping column,
and the UNION, EXCEPT, and INTERSECT operators refer to character strings,
*the specific value selected by these operations from a set of such equal
values is implementation- dependent*.

-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Finnerty (#1)
Re: Nondeterministic collations and the value returned by GROUP BY x

Jim Finnerty <jfinnert@amazon.com> writes:

PostgreSQL 12 and onward supports nondeterministic collations. For "GROUP
BY x", which value of 'x' will PostgreSQL return in this case? The first
value of x?

The SQL standard (section 8.2) states that the specific value returned is
implementation-defined, but requires that the value must be one of the
specific values in the set of values that compare equally:

d) Depending on the collation, two strings may compare as equal even if they
are of different lengths or contain different sequences of characters. When
any of the operations MAX, MIN, and DISTINCT reference a grouping column,
and the UNION, EXCEPT, and INTERSECT operators refer to character strings,
*the specific value selected by these operations from a set of such equal
values is implementation- dependent*.

As I recall, "implementation-dependent" means specifically that we *don't*
have to make any promise about which particular value will be selected.
If it said "implementation-defined" then we would.

I expect that in practice it'd be the first of the group that arrives at
the grouping plan node --- but that doesn't really get you any closer
to being able to say which one it is exactly. The input is either not
ordered at all, or ordered by something like a Sort node, which itself
is not going to make any promises about which one of a group of peers
is delivered first.

regards, tom lane

#3Jim Finnerty
jfinnert@amazon.com
In reply to: Tom Lane (#2)
Re: Nondeterministic collations and the value returned by GROUP BY x

right. It doesn't matter which of the values is returned; however, a
plausible-sounding implementation would case-fold the value, like GROUP BY
LOWER(x), but the case-folded value isn't necessarily one of the original
values and so that could be subtly wrong in the case-insensitive case, and
could in principle be completely wrong in the most general nondeterministic
collation case where the case-folded value isn't even equal to the other
members of the set.

does the implementation in PG12 ensure that some member of the set of equal
values is chosen as the representative value?

-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Finnerty (#3)
Re: Nondeterministic collations and the value returned by GROUP BY x

Jim Finnerty <jfinnert@amazon.com> writes:

right. It doesn't matter which of the values is returned; however, a
plausible-sounding implementation would case-fold the value, like GROUP BY
LOWER(x), but the case-folded value isn't necessarily one of the original
values and so that could be subtly wrong in the case-insensitive case, and
could in principle be completely wrong in the most general nondeterministic
collation case where the case-folded value isn't even equal to the other
members of the set.

does the implementation in PG12 ensure that some member of the set of equal
values is chosen as the representative value?

Without having actually looked, I'm pretty certain it does.
Considerations of data type independence would seem to rule out a hack
like applying case folding. There might be case folding happening
internally to comparison functions, like citext_cmp, but that wouldn't
affect the grouping logic that is going to save aside one of the
group of peer values.

regards, tom lane