array_agg() does not stop aggregating according to HAVING clause
Hello list,
I have a query that goes through *billions* of rows and for the columns
that have an infrequent "datatag" (HAVING count(test_datatag_n)<10) it
selects all the IDs of the entries (array_agg(run_n)). Here is the full
query:
INSERT INTO infrequent_datatags_in_this_chunk
SELECT datatag, datatags.datatag_n, array_agg(run_n)
FROM runs_raw
JOIN datatags USING(datatag_n)
WHERE workitem_n >= 295
AND workitem_n < 714218
AND datatag IS NOT NULL
GROUP BY datatags.datatag_n
HAVING count(datatag_n) < 10
AND count(datatag_n) > 0 -- Not really needed because of the JOIN above
;
The runs_raw table has run_n as the primary key id, and an index on
workitem_n. The datatags table is a key value store with datatag_n as
primary key.
The problem is that this is extremely slow (5 hours), most likely because
it creates tens of gigabytes of temporary files as I see in the logs. I
suspect that it is writing to disk the array_agg(run_n) of all entries and
not only those HAVING count(datatag_n)<10. (I might be wrong though, as
this is only an assumption based on the amount of data written; I don't
know of any way to examine the temporary files written). While this query
is going through billions of rows, the ones with infrequent datatags are
maybe 10M.
How do I tell postgres to stop aggregating when count>=10?
Thank you in advance,
Dimitris
Dimitrios Apostolou <jimis@gmx.net> writes:
I have a query that goes through *billions* of rows and for the columns
that have an infrequent "datatag" (HAVING count(test_datatag_n)<10) it
selects all the IDs of the entries (array_agg(run_n)). Here is the full
query:
INSERT INTO infrequent_datatags_in_this_chunk
SELECT datatag, datatags.datatag_n, array_agg(run_n)
FROM runs_raw
JOIN datatags USING(datatag_n)
WHERE workitem_n >= 295
AND workitem_n < 714218
AND datatag IS NOT NULL
GROUP BY datatags.datatag_n
HAVING count(datatag_n) < 10
AND count(datatag_n) > 0 -- Not really needed because of the JOIN above
;
The problem is that this is extremely slow (5 hours), most likely because
it creates tens of gigabytes of temporary files as I see in the logs. I
suspect that it is writing to disk the array_agg(run_n) of all entries and
not only those HAVING count(datatag_n)<10.
Well, yes: the two aggregates (array_agg and count) are computed
concurrently in a single Aggregate plan node scanning the output
of the JOIN. There's no way to apply the HAVING filter until
after the aggregation is finished.
I think this approach is basically forced by the SQL standard's
semantics for grouping/aggregation.
How do I tell postgres to stop aggregating when count>=10?
The only way to do this would be to do two separate passes of
aggregation in separate sub-queries. Perhaps like
WITH rare AS (
SELECT datatag_n
FROM runs_raw
WHERE workitem_n >= 295
AND workitem_n < 714218
AND datatag IS NOT NULL
GROUP BY datatag_n
HAVING count(datatag_n) < 10
AND count(datatag_n) > 0
)
INSERT INTO infrequent_datatags_in_this_chunk
SELECT datatag, datatags.datatag_n, array_agg(run_n)
FROM runs_raw
JOIN datatags USING(datatag_n)
JOIN rare USING(datatag_n)
GROUP BY datatags.datatag_n
;
I can't tell from what you said which level the workitem_n and
datatag conditions go at, so this is just a draft-quality
query. But I think the structure is basically okay, given
that you said datatag_n is unique in datatags (so there's no
need to join it in the WITH step).
regards, tom lane
On Sat, 17 Aug 2024, Tom Lane wrote:
Well, yes: the two aggregates (array_agg and count) are computed
concurrently in a single Aggregate plan node scanning the output
of the JOIN. There's no way to apply the HAVING filter until
after the aggregation is finished.I think this approach is basically forced by the SQL standard's
semantics for grouping/aggregation.
FWIW I also tried:
HAVING array_length(array_agg(run_n), 1) < 10;
but I saw the same amount of temp files, at least in the short duration of
my test run.
Thank you, I will split this into two passes like you suggested. It's just
that I'm doing another 3 passes over this table for different things I
calculate (different GROUP BY, different WHERE clauses) and I was hoping
to minimize the time spent. But avoiding the array_agg() over everything
is my top priority ATM so I'll definitely try.
Regards,
Dimitris