Contradictory behavior of array_agg(distinct) aggregate.
Hi hackers!
Is it only me who consider that current behavior of array_agg(distinct)
contradicts to interpretation of nulls in other cases ("null" is
something like "unknown" which means that we can not say weather two
nulls are the same or not). This is why it is allowed to insert multiple
nulls in the unique column.
postgres=# create table t(x integer unique);
CREATE TABLE
postgres=# insert into t values (null),(null);
INSERT 0 2
postgres=# select count(distinct x) from t;
count
-------
0
(1 row)
postgres=# select array_agg(distinct x) from t;
array_agg
-----------
{NULL}
(1 row)
postgres=# select array_agg(x) from t;
array_agg
-------------
{NULL,NULL}
(1 row)
So what is the number of distinct "x" values in this case? I think that
according to SQL model - 0 (as count(distinct) returns).
Why in this case array_agg(distinct x) returns non-empty array?
Yes, unlike most other aggregates, `array_agg` is not ignoring null values.
But is it correct to treat two null values as the same (non-distinct)?
IMHO correct result in this case should be either {} or NULL, either
{NULL,NULL}.
Konstantin Knizhnik <knizhnik@garret.ru> writes:
postgres=# create table t(x integer unique);
CREATE TABLE
postgres=# insert into t values (null),(null);
INSERT 0 2
postgres=# select count(distinct x) from t;
count
-------
0
(1 row)
postgres=# select array_agg(distinct x) from t;
array_agg
-----------
{NULL}
(1 row)
postgres=# select array_agg(x) from t;
array_agg
-------------
{NULL,NULL}
(1 row)
I see nothing contradictory here. "array_agg(distinct x)"
combines the two NULLs into one, which is the normal
behavior of DISTINCT. "count(distinct x)" does the same
thing --- but count() only counts non-null inputs, so
you end with zero.
regards, tom lane
On 04/12/2024 9:03 am, Tom Lane wrote:
Konstantin Knizhnik <knizhnik@garret.ru> writes:
postgres=# create table t(x integer unique);
CREATE TABLE
postgres=# insert into t values (null),(null);
INSERT 0 2
postgres=# select count(distinct x) from t;
count
-------
0
(1 row)
postgres=# select array_agg(distinct x) from t;
array_agg
-----------
{NULL}
(1 row)
postgres=# select array_agg(x) from t;
array_agg
-------------
{NULL,NULL}
(1 row)I see nothing contradictory here. "array_agg(distinct x)"
combines the two NULLs into one, which is the normal
behavior of DISTINCT.
Sorry.
It is actually inconsistency in basic SQL model in interpretation of
NULL by UNIQUE and DISTINCT, and array_agg just follows this rule.
On Tuesday, December 3, 2024, Konstantin Knizhnik <knizhnik@garret.ru>
wrote:
Is it only me who consider that current behavior of array_agg(distinct)
contradicts to interpretation of nulls in other cases ("null" is something
like "unknown" which means that we can not say weather two nulls are the
same or not).
Null value handling has a few varied behaviors related to it. This
particular one is best thought of as desirably being consistent with group
by.
This is why it is allowed to insert multiple nulls in the unique column.
Nowadays the DBA gets to choose which of the two behaviors a unique index
applies, which allows indexes to get on the same page as group by et al.,
thus fixing your inconsistency claim here.
postgres=# create table t(x integer unique);
CREATE TABLE
postgres=# insert into t values (null),(null);
INSERT 0 2
postgres=# select count(distinct x) from t;
count
-------
0
(1 row)
postgres=# select array_agg(distinct x) from t;
array_agg
-----------
{NULL}
(1 row)postgres=# select array_agg(x) from t;
array_agg
-------------
{NULL,NULL}
(1 row)So what is the number of distinct "x" values in this case? I think that
according to SQL model - 0 (as count(distinct) returns).
1, but getting this answer computed is a non-trivial expression as the
count aggregate can’t do the counting.
Why in this case array_agg(distinct x) returns non-empty array?
I can be convinced to see an inconsistency here. In count though, not
array_agg. The inability for count to see and thus count null values
cannot be worked around while you can always apply a filter clause to
ignore null values you don’t want.
Yes, unlike most other aggregates, `array_agg` is not ignoring null values.
But is it correct to treat two null values as the same (non-distinct)?
Yes, in most contexts where null values are forced to be compared to each
other they do so by defining all null values to be representationally
identical. See group by for the most authoritative reference case.
IMHO correct result in this case should be either {} or NULL, either
{NULL,NULL}.
You have a typo here somewhere…
If you want the empty array use a filter clause to make it behave
“strictly”. Producing a null output seems indefensible.
A policy that all nulls values are indistinct from one another, which is
the most prevalent one in SQL, makes the most sense to me. My gut says
that “group by col nulls [not] distinct” would be an undesirable thing to
add to the language. It was probably added to unique indexes because they
went the wrong way and needed a way to course-correct.
David J.
Konstantin Knizhnik <knizhnik@garret.ru> writes:
On 04/12/2024 9:03 am, Tom Lane wrote:
Konstantin Knizhnik <knizhnik@garret.ru> writes:
postgres=# create table t(x integer unique);
CREATE TABLE
postgres=# insert into t values (null),(null);
INSERT 0 2
postgres=# select count(distinct x) from t;
count
-------
0
(1 row)
postgres=# select array_agg(distinct x) from t;
array_agg
-----------
{NULL}
(1 row)
postgres=# select array_agg(x) from t;
array_agg
-------------
{NULL,NULL}
(1 row)I see nothing contradictory here. "array_agg(distinct x)"
combines the two NULLs into one, which is the normal
behavior of DISTINCT.Sorry.
It is actually inconsistency in basic SQL model in interpretation of
NULL by UNIQUE and DISTINCT, and array_agg just follows this rule.
The behaviour of DISTINCT in aggregates matches the behaviour of the IS
(NOT) DISTINCT FROM predicate, which considers NULLs NOT DISTINCT from
eachother.
UNIQUE constraints leave it implementation-defined whether NULLs are
considered distinct (PostgreSQL defaults to NULLS DISTINCT), but that
can be overridden in the constraint definition.
- ilmari