NULL-handling in aggregate(DISTINCT ...)

Started by Andrew Gierthabout 16 years ago6 messages

andrew@tao11.riddles.org.uk

about 16 years ago

Quoth the comments in nodeAgg.c:

* We don't currently implement DISTINCT aggs for aggs having more
* than one argument. This isn't required for anything in the SQL
* spec, but really it ought to be implemented for
* feature-completeness. FIXME someday.

and:

* DISTINCT always suppresses nulls, per SQL spec, regardless of the
* transition function's strictness.

(What the SQL spec actually says is that aggregate calls which are
<general set operation> ignore all nulls regardless of whether they
are ALL or DISTINCT. Other kinds of aggregates are not permitted by
the spec to use ALL or DISTINCT.)

Currently we have this behaviour:

postgres=# select array_agg(all a) from (values (1),(null)) v(a);
array_agg
-----------
{1,NULL}
(1 row)

postgres=# select array_agg(distinct a) from (values (1),(null)) v(a);
array_agg
-----------
{1}
(1 row)

which personally I feel is somewhat wrong, since 1 and NULL are in
fact distinct, but which is due to the logic expressed in the second
comment above. (The spec does not allow array_agg(distinct a) so it
is no help here.)

Now the question: If the limit of one argument for DISTINCT aggs were
removed (which I'm considering doing as part of an update to the
aggregate ORDER BY patch I posted a while back), what should be the
behaviour of agg(distinct x,y) where one or both of x or y is null?
And should it depend on the strictness of the transition function?

--
Andrew (irc:RhodiumToad)

Tom Lane

tgl@sss.pgh.pa.us

about 16 years ago

In reply to: Andrew Gierth (#1)

Re: NULL-handling in aggregate(DISTINCT ...)

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Now the question: If the limit of one argument for DISTINCT aggs were
removed (which I'm considering doing as part of an update to the
aggregate ORDER BY patch I posted a while back), what should be the
behaviour of agg(distinct x,y) where one or both of x or y is null?
And should it depend on the strictness of the transition function?

I think you could probably just change it: make DISTINCT work as per
regular DISTINCT (treat null like a value, keep one copy). All the
spec-conforming aggregates are strict and would ignore the null in the
next step anyway.

regards, tom lane

Andrew Gierth

andrew@tao11.riddles.org.uk

about 16 years ago

In reply to: Tom Lane (#2)

Re: NULL-handling in aggregate(DISTINCT ...)

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Now the question: If the limit of one argument for DISTINCT aggs were
removed (which I'm considering doing as part of an update to the
aggregate ORDER BY patch I posted a while back), what should be the
behaviour of agg(distinct x,y) where one or both of x or y is null?
And should it depend on the strictness of the transition function?

Tom> I think you could probably just change it: make DISTINCT work as
Tom> per regular DISTINCT (treat null like a value, keep one copy).
Tom> All the spec-conforming aggregates are strict and would ignore
Tom> the null in the next step anyway.

Change it for single-arg DISTINCT too? And the resulting change to the
established behaviour of array_agg is acceptable? Just want to be clear
here.

--
Andrew.

Tom Lane

tgl@sss.pgh.pa.us

about 16 years ago

In reply to: Andrew Gierth (#3)

Re: NULL-handling in aggregate(DISTINCT ...)

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> I think you could probably just change it: make DISTINCT work as
Tom> per regular DISTINCT (treat null like a value, keep one copy).
Tom> All the spec-conforming aggregates are strict and would ignore
Tom> the null in the next step anyway.

Change it for single-arg DISTINCT too? And the resulting change to the
established behaviour of array_agg is acceptable? Just want to be clear
here.

I doubt that very many people are depending on the behavior of
array_agg(DISTINCT); and anyway it could be argued that the present
behavior is a bug, since it doesn't work like standard DISTINCT.
I don't see a problem with changing it, though it should be
release-noted.

regards, tom lane

Andrew Gierth

andrew@tao11.riddles.org.uk

about 16 years ago

In reply to: Tom Lane (#4)

Re: NULL-handling in aggregate(DISTINCT ...)

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Change it for single-arg DISTINCT too? And the resulting change to the
established behaviour of array_agg is acceptable? Just want to be clear
here.

Tom> I doubt that very many people are depending on the behavior of
Tom> array_agg(DISTINCT); and anyway it could be argued that the
Tom> present behavior is a bug, since it doesn't work like standard
Tom> DISTINCT. I don't see a problem with changing it, though it
Tom> should be release-noted.

A followup question: currently the code uses the "datum" interface for
tuplesort. Obviously with multiple columns the slot interface is used
instead; but is there any performance advantage for staying with the
datum interface for the single-column case?

--
Andrew.

Tom Lane

tgl@sss.pgh.pa.us

about 16 years ago

In reply to: Andrew Gierth (#5)

Re: NULL-handling in aggregate(DISTINCT ...)

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

A followup question: currently the code uses the "datum" interface for
tuplesort. Obviously with multiple columns the slot interface is used
instead; but is there any performance advantage for staying with the
datum interface for the single-column case?

No idea ... measure it and see.

regards, tom lane