Are multiple array_aggs going to be in the same order?

Started by Guyren Howeabout 9 years ago11 messagesgeneral
Jump to latest
#1Guyren Howe
guyren@gmail.com

If I do a SELECT with a GROUP_BY and multiple ARRAY_AGGs, will the ARRAY_AGGs be guaranteed to have entries in the same (ie corresponding) order?

eg

SELECT
u.name,
ARRAY_AGG(o.order_date) AS order_dates,
ARRAY_AGG(o.order_total) AS order_totals
FROM
user u JOIN
orders o USING (user_id)
GROUP BY
u.user_id

#2Merlin Moncure
mmoncure@gmail.com
In reply to: Guyren Howe (#1)
Re: Are multiple array_aggs going to be in the same order?

On Sun, Apr 9, 2017 at 4:27 PM, Guyren Howe <guyren@gmail.com> wrote:

If I do a SELECT with a GROUP_BY and multiple ARRAY_AGGs, will the ARRAY_AGGs be guaranteed to have entries in the same (ie corresponding) order?

eg

SELECT
u.name,
ARRAY_AGG(o.order_date) AS order_dates,
ARRAY_AGG(o.order_total) AS order_totals
FROM
user u JOIN
orders o USING (user_id)
GROUP BY
u.user_id

It is unsafe to rely on aggregation order unless specified -- you can
add ORDER BY to the aggregation clause.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#2)
Re: Are multiple array_aggs going to be in the same order?

Merlin Moncure <mmoncure@gmail.com> writes:

On Sun, Apr 9, 2017 at 4:27 PM, Guyren Howe <guyren@gmail.com> wrote:

If I do a SELECT with a GROUP_BY and multiple ARRAY_AGGs, will the ARRAY_AGGs be guaranteed to have entries in the same (ie corresponding) order?

eg

SELECT
u.name,
ARRAY_AGG(o.order_date) AS order_dates,
ARRAY_AGG(o.order_total) AS order_totals
FROM
user u JOIN
orders o USING (user_id)
GROUP BY
u.user_id

It is unsafe to rely on aggregation order unless specified -- you can
add ORDER BY to the aggregation clause.

You definitely can't assume anything about the order in which the FROM
clause will deliver rows, but I think that's not quite what the question
was. If I read it right, the OP wants to be sure that the two aggregate
functions will see the data in the *same* unspecified order. I think
that's a pretty safe assumption. The server would have to go way
out of its way to do differently, and it doesn't.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Paul Jungwirth
pj@illuminatedcomputing.com
In reply to: Merlin Moncure (#2)
Re: Are multiple array_aggs going to be in the same order?

On 04/10/2017 09:33 AM, Merlin Moncure wrote:

On Sun, Apr 9, 2017 at 4:27 PM, Guyren Howe <guyren@gmail.com> wrote:

If I do a SELECT with a GROUP_BY and multiple ARRAY_AGGs,
will the ARRAY_AGGs be guaranteed to have entries in the
same (ie corresponding) order?

It is unsafe to rely on aggregation order unless specified --
you can add ORDER BY to the aggregation clause.

Just thought I'd add that if you do this:

SELECT u.name,
ARRAY_AGG(o.order_date ORDER BY o.id) AS order_dates,
ARRAY_AGG(o.order_total ORDER BY o.id) AS order_totals
...

Then you can check EXPLAIN and should see that Postgres is only doing
one sort, not two, so there is no performance cost. (Of course for more
complicated queries you might want to double check what EXPLAIN intends
to do.)

But something I've wondered myself is how well Postgres handles sorting
already-mostly-sorted lists? I tried diving into the code, starting from
here:

https://doxygen.postgresql.org/tuplesort_8c_source.html

but I couldn't answer the question myself. The comments say that as long
as the tuples all fit in work_mem, it uses qsort, but the code appears
to call qsort_ssup or qsort_tuple, whose definitions I couldn't find. (I
looks like they are from qsort_tuple.c which is generated by a Perl
script.) I know that qsort(3) is not necessarily quicksort, despite the
name. Does anyone know what algorithm Postgres uses? It seems like the
mostly-already-sorted case would happen a lot, so I'm curious if
Postgres pays a performance cost there?

Thanks,
Paul

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Merlin Moncure
mmoncure@gmail.com
In reply to: Tom Lane (#3)
Re: Are multiple array_aggs going to be in the same order?

On Mon, Apr 10, 2017 at 12:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

On Sun, Apr 9, 2017 at 4:27 PM, Guyren Howe <guyren@gmail.com> wrote:

If I do a SELECT with a GROUP_BY and multiple ARRAY_AGGs, will the ARRAY_AGGs be guaranteed to have entries in the same (ie corresponding) order?

eg

SELECT
u.name,
ARRAY_AGG(o.order_date) AS order_dates,
ARRAY_AGG(o.order_total) AS order_totals
FROM
user u JOIN
orders o USING (user_id)
GROUP BY
u.user_id

It is unsafe to rely on aggregation order unless specified -- you can
add ORDER BY to the aggregation clause.

You definitely can't assume anything about the order in which the FROM
clause will deliver rows, but I think that's not quite what the question
was. If I read it right, the OP wants to be sure that the two aggregate
functions will see the data in the *same* unspecified order. I think
that's a pretty safe assumption. The server would have to go way
out of its way to do differently, and it doesn't.

Sure, but isn't it fair to consider that an implementation artifact?
If his code depends on that ordering being the same across aggregate
functions, and the SQL standard doesn't specify that (I guess it
might, but I'm skeptical), he ought to specify that for clarify at the
very least.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6David G. Johnston
david.g.johnston@gmail.com
In reply to: Merlin Moncure (#5)
Re: Are multiple array_aggs going to be in the same order?

On Mon, Apr 10, 2017 at 11:02 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Mon, Apr 10, 2017 at 12:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

On Sun, Apr 9, 2017 at 4:27 PM, Guyren Howe <guyren@gmail.com> wrote:

If I do a SELECT with a GROUP_BY and multiple ARRAY_AGGs, will the

ARRAY_AGGs be guaranteed to have entries in the same (ie corresponding)
order?

eg

SELECT
u.name,
ARRAY_AGG(o.order_date) AS order_dates,
ARRAY_AGG(o.order_total) AS order_totals
FROM
user u JOIN
orders o USING (user_id)
GROUP BY
u.user_id

It is unsafe to rely on aggregation order unless specified -- you can
add ORDER BY to the aggregation clause.

You definitely can't assume anything about the order in which the FROM
clause will deliver rows, but I think that's not quite what the question
was. If I read it right, the OP wants to be sure that the two aggregate
functions will see the data in the *same* unspecified order. I think
that's a pretty safe assumption. The server would have to go way
out of its way to do differently, and it doesn't.

Sure, but isn't it fair to consider that an implementation artifact?
If his code depends on that ordering being the same across aggregate
functions, and the SQL standard doesn't specify that (I guess it
might, but I'm skeptical), he ought to specify that for clarify at the
very least.

So, the presence of ORDER BY in the aggregate function call is a PostgreSQL
extension...

It seems reasonable to declare that the order of the values in the
generated array match whatever order the FROM clause supplies the rows. If
that is not acceptable a PostgreSQL-specific ORDER BY modifier can be added
which will cause an additional sort-and-scan of the input relation to occur
(optimized across multiple column invocations when possible). Thus two
aggregate functions w/o an ORDER BY will always see the source rows in the
same order.

SELECT array_agg(v)
FROM (
SELECT * FROM (VALUES (3),(1),(2)) vals (v) ORDER BY 1
) AS src

The only real confusion is whether a query like the above is guaranteed to
supply rows to the outer select target list in order. If that is indeed
the case then the overall behavior seems quite reasonable to explicitly
define like above.

Dave

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: David G. Johnston (#6)
Re: Are multiple array_aggs going to be in the same order?

"David G. Johnston" <david.g.johnston@gmail.com> writes:

On Mon, Apr 10, 2017 at 11:02 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

Sure, but isn't it fair to consider that an implementation artifact?

So, the presence of ORDER BY in the aggregate function call is a PostgreSQL
extension...

It seems reasonable to declare that the order of the values in the
generated array match whatever order the FROM clause supplies the rows. If
that is not acceptable a PostgreSQL-specific ORDER BY modifier can be added
which will cause an additional sort-and-scan of the input relation to occur
(optimized across multiple column invocations when possible).

Yes, and in fact we documented the ORDER-BY-in-subselect solution back
before we had the ORDER-BY-in-aggregate feature. I don't remember exactly
where, but I'm sure it's still described somewhere. So it is documented
behavior that an aggregate without its own ORDER BY will see the rows in
whatever order the FROM clause supplies them.

I'm not very keen on recommending that the OP insert an ORDER BY into
each aggregate call, because that would cause a separate sort for each
aggregate (unless someone's improved that recently while I wasn't
looking).

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Merlin Moncure
mmoncure@gmail.com
In reply to: Tom Lane (#7)
Re: Are multiple array_aggs going to be in the same order?

On Mon, Apr 10, 2017 at 1:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"David G. Johnston" <david.g.johnston@gmail.com> writes:

On Mon, Apr 10, 2017 at 11:02 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

Sure, but isn't it fair to consider that an implementation artifact?

So, the presence of ORDER BY in the aggregate function call is a PostgreSQL
extension...

It seems reasonable to declare that the order of the values in the
generated array match whatever order the FROM clause supplies the rows. If
that is not acceptable a PostgreSQL-specific ORDER BY modifier can be added
which will cause an additional sort-and-scan of the input relation to occur
(optimized across multiple column invocations when possible).

Yes, and in fact we documented the ORDER-BY-in-subselect solution back
before we had the ORDER-BY-in-aggregate feature. I don't remember exactly
where, but I'm sure it's still described somewhere. So it is documented
behavior that an aggregate without its own ORDER BY will see the rows in
whatever order the FROM clause supplies them.

The documentation is a bit ambiguous on the topic TBH. Via
https://www.postgresql.org/docs/9.6/static/functions-aggregate.html:

"The aggregate functions array_agg, json_agg, jsonb_agg,
json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as
similar user-defined aggregate functions, produce meaningfully
different result values depending on the order of the input values.
This ordering is unspecified by default, but can be controlled by
writing an ORDER BY clause within the aggregate call, as shown in
Section 4.2.7. Alternatively, supplying the input values from a sorted
subquery will usually work. For example:

SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;

Beware that this approach can fail if the outer query level contains
additional processing, such as a join, because that might cause the
subquery's output to be reordered before the aggregate is computed."

If you think the behavior ought to be defined to work that way, that's
fine by me. A small documentation fix could clarify that, I think.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Paul Jungwirth
pj@illuminatedcomputing.com
In reply to: Tom Lane (#7)
Re: Are multiple array_aggs going to be in the same order?

On 04/10/2017 11:35 AM, Tom Lane wrote:

I'm not very keen on recommending that the OP insert an ORDER BY into
each aggregate call, because that would cause a separate sort for each
aggregate (unless someone's improved that recently while I wasn't
looking).

I mentioned this in my other email, upon a second look I was misreading
the EXPLAIN output. (The sort was for the GROUP BY, not the individual
ORDER BYs.) Do aggregate function sorts even show up in EXPLAIN? I can't
seem to find any. For example in this try the sorts are just for
grouping and joining:

db=> explain select c.id, array_agg(e.item order by e.id),
array_agg(e.payee order by e.id) from expense_categories c join expenses
e on e.expense_category_id = c.id group by c.id;
QUERY PLAN

---------------------------------------------------------------------------------------
GroupAggregate (cost=223.44..285.14 rows=16 width=30)
-> Merge Join (cost=223.44..264.44 rows=2728 width=30)
Merge Cond: (c.id = e.expense_category_id)
-> Sort (cost=1.48..1.52 rows=16 width=4)
Sort Key: c.id
-> Seq Scan on expense_categories c (cost=0.00..1.16
rows=16 width=4)
-> Sort (cost=221.96..228.78 rows=2728 width=30)
Sort Key: e.expense_category_id
-> Seq Scan on expenses e (cost=0.00..66.28 rows=2728
width=30)
(9 rows)

Paul

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#8)
Re: Are multiple array_aggs going to be in the same order?

On Mon, Apr 10, 2017 at 1:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Mon, Apr 10, 2017 at 1:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yes, and in fact we documented the ORDER-BY-in-subselect solution back
before we had the ORDER-BY-in-aggregate feature. I don't remember exactly
where, but I'm sure it's still described somewhere. So it is documented
behavior that an aggregate without its own ORDER BY will see the rows in
whatever order the FROM clause supplies them.

The documentation is a bit ambiguous on the topic TBH. Via
https://www.postgresql.org/docs/9.6/static/functions-aggregate.html:

"The aggregate functions array_agg, json_agg, jsonb_agg,
json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as
similar user-defined aggregate functions, produce meaningfully
different result values depending on the order of the input values.
This ordering is unspecified by default, but can be controlled by
writing an ORDER BY clause within the aggregate call, as shown in
Section 4.2.7. Alternatively, supplying the input values from a sorted
subquery will usually work.

<snip>

What do you think about injecting the following lingo?

... depending on the order of the input values. This ordering is
unspecified by default, but can be controlled by ...

would be changed to

... depending on the order of the input values. Input value ordering
will be consistent across multiple order dependent aggregate functions
across the same grouping but is otherwise unspecified by default as
long as all there is no explicit ordering for any aggregate function
in the grouping. Furthermore, input value ordering can be controlled
by ...

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Paul Jungwirth (#9)
Re: Are multiple array_aggs going to be in the same order?

Paul Jungwirth <pj@illuminatedcomputing.com> writes:

On 04/10/2017 11:35 AM, Tom Lane wrote:

I'm not very keen on recommending that the OP insert an ORDER BY into
each aggregate call, because that would cause a separate sort for each
aggregate (unless someone's improved that recently while I wasn't
looking).

I mentioned this in my other email, upon a second look I was misreading
the EXPLAIN output. (The sort was for the GROUP BY, not the individual
ORDER BYs.) Do aggregate function sorts even show up in EXPLAIN?

They don't. It's not a plannable behavior, just something that's
hard-wired into nodeAgg.c. I'd like to see that improved someday,
but I don't think it's really high on anyone's priority list.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general