Improve hash-agg performance

Started by Andres Freundover 9 years ago4 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

There's two things I found while working on faster expression
evaluation, slot deforming and batched execution. As those two issues
often seemed quite dominant cost-wise it seemed worthwhile to evaluate
them independently.

1) We atm do one ExecProject() to compute each aggregate's
arguments. Turns out it's noticeably faster to compute the argument
for all aggregates in one go. Both because it reduces the amount of
function call / moves more things into a relatively tight loop, and
because it allows to deform all the required columns at once, rather
than one-by-one. For a single aggregate it'd be faster to avoid
ExecProject alltogether (i.e. directly evaluate the expression as we
used to), but as soon you have two the new approach is faster.

2) For hash-aggs we right now we store the representative tuple using
the input tuple's format, with unneeded columns set to NULL. That
turns out to be expensive if the aggregated-on columns are not
leading columns, because we have to skip over a potentially large
number of NULLs. The fix here is to simply use a different tuple
format for the hashtable. That doesn't cause overhead, because we
already move columns in/out of the hashslot explicitly anyway.

Comments?

Regards,

Andres Freund

Attachments:

0001-Perform-one-only-projection-to-compute-agg-arguments.patchtext/x-patch; charset=us-asciiDownload+105-36
0002-User-narrower-representative-tuples-in-the-hash-agg-.patchtext/x-patch; charset=us-asciiDownload+111-57
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#1)
Re: Improve hash-agg performance

On 11/03/2016 01:07 PM, Andres Freund wrote:

Hi,

There's two things I found while working on faster expression
evaluation, slot deforming and batched execution. As those two issues
often seemed quite dominant cost-wise it seemed worthwhile to evaluate
them independently.

1) We atm do one ExecProject() to compute each aggregate's
arguments. Turns out it's noticeably faster to compute the argument
for all aggregates in one go. Both because it reduces the amount of
function call / moves more things into a relatively tight loop, and
because it allows to deform all the required columns at once, rather
than one-by-one. For a single aggregate it'd be faster to avoid
ExecProject alltogether (i.e. directly evaluate the expression as we
used to), but as soon you have two the new approach is faster.

Makes sense. If we do your refactoring of ExecEvalExpr into an
intermediate opcode representation, I assume the performance difference
will go away anyway.

2) For hash-aggs we right now we store the representative tuple using
the input tuple's format, with unneeded columns set to NULL. That
turns out to be expensive if the aggregated-on columns are not
leading columns, because we have to skip over a potentially large
number of NULLs. The fix here is to simply use a different tuple
format for the hashtable. That doesn't cause overhead, because we
already move columns in/out of the hashslot explicitly anyway.

Heh, I came to the same conclusion a couple of months ago when I was
profiling the aggregate code. I never got around to finish up and post
the patch I wrote back then, but here you go, for comparison. It's
pretty much the same as what you got here. So yeah, seems like a good idea.

- Heikki

Attachments:

0001-Don-t-store-unused-columns-in-hash-table.patchtext/x-diff; name=0001-Don-t-store-unused-columns-in-hash-table.patchDownload+108-28
#3Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#2)
Re: Improve hash-agg performance

On 2016-11-04 15:18:49 +0200, Heikki Linnakangas wrote:

On 11/03/2016 01:07 PM, Andres Freund wrote:

Hi,

There's two things I found while working on faster expression
evaluation, slot deforming and batched execution. As those two issues
often seemed quite dominant cost-wise it seemed worthwhile to evaluate
them independently.

1) We atm do one ExecProject() to compute each aggregate's
arguments. Turns out it's noticeably faster to compute the argument
for all aggregates in one go. Both because it reduces the amount of
function call / moves more things into a relatively tight loop, and
because it allows to deform all the required columns at once, rather
than one-by-one. For a single aggregate it'd be faster to avoid
ExecProject alltogether (i.e. directly evaluate the expression as we
used to), but as soon you have two the new approach is faster.

Makes sense. If we do your refactoring of ExecEvalExpr into an intermediate
opcode representation, I assume the performance difference will go away
anyway.

Precisely.

2) For hash-aggs we right now we store the representative tuple using
the input tuple's format, with unneeded columns set to NULL. That
turns out to be expensive if the aggregated-on columns are not
leading columns, because we have to skip over a potentially large
number of NULLs. The fix here is to simply use a different tuple
format for the hashtable. That doesn't cause overhead, because we
already move columns in/out of the hashslot explicitly anyway.

Heh, I came to the same conclusion a couple of months ago when I was
profiling the aggregate code. I never got around to finish up and post the
patch I wrote back then, but here you go, for comparison. It's pretty much
the same as what you got here. So yeah, seems like a good idea.

+		/*
+		 * Note that we don't deduplicate key columns. That would complicate
+		 * the comparisons. Don't write silly queries! (Not sure if that would get
+		 * through the parser and optimizer, though).

Hehe. You made me spill more coffee.

Thanks for looking!

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: Improve hash-agg performance

Hi,

On 2016-11-03 04:07:21 -0700, Andres Freund wrote:

Hi,

There's two things I found while working on faster expression
evaluation, slot deforming and batched execution. As those two issues
often seemed quite dominant cost-wise it seemed worthwhile to evaluate
them independently.

1) We atm do one ExecProject() to compute each aggregate's
arguments. Turns out it's noticeably faster to compute the argument
for all aggregates in one go. Both because it reduces the amount of
function call / moves more things into a relatively tight loop, and
because it allows to deform all the required columns at once, rather
than one-by-one. For a single aggregate it'd be faster to avoid
ExecProject alltogether (i.e. directly evaluate the expression as we
used to), but as soon you have two the new approach is faster.

2) For hash-aggs we right now we store the representative tuple using
the input tuple's format, with unneeded columns set to NULL. That
turns out to be expensive if the aggregated-on columns are not
leading columns, because we have to skip over a potentially large
number of NULLs. The fix here is to simply use a different tuple
format for the hashtable. That doesn't cause overhead, because we
already move columns in/out of the hashslot explicitly anyway.

I pushed a bit more polished versions of these.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers