hashagg slowdown due to spill changes

Started by Andres Freundalmost 6 years ago31 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

While preparing my pgcon talk I noticed that our hash-agg performance
degraded noticeably. Looks to me like it's due to the spilling-hashagg
changes.

Sample benchmark:

config:
-c huge_pages=on -c shared_buffers=32GB -c jit=0 -c max_parallel_workers_per_gather=0
(largely just to reduce variance)

data prep:
CREATE TABLE fewgroups_many_rows AS SELECT (random() * 4)::int cat, (random()*10000)::int val FROM generate_series(1, 100000000);
VACUUM (FREEZE, ANALYZE) fewgroups_many_rows;

test prep:
CREATE EXTENSION IF NOT EXISTS pg_prewarm;SELECT pg_prewarm('fewgroups_many_rows', 'buffer');

test:
SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1;

Using best-of-three timing:

12 12221.031 ms
master 13855.129 ms

While not the end of the world, that's a definitely noticable and
reproducible slowdown (~12%).

I don't think this is actually an inherent cost, but a question of how
the code ended up being organized. Here's a perf diff of profiles for
both versions:

# Baseline  Delta Abs  Shared Object     Symbol
# ........  .........  ................  .........................................
#
               +6.70%  postgres          [.] LookupTupleHashEntryHash
               +6.37%  postgres          [.] prepare_hash_slot
               +4.74%  postgres          [.] TupleHashTableHash_internal.isra.0
    20.36%     -2.89%  postgres          [.] ExecInterpExpr
     6.31%     -2.73%  postgres          [.] lookup_hash_entries
               +2.36%  postgres          [.] lookup_hash_entry
               +2.14%  postgres          [.] ExecJustAssignScanVar
     2.28%     +1.97%  postgres          [.] ExecScan
     2.54%     +1.93%  postgres          [.] MemoryContextReset
     3.84%     -1.86%  postgres          [.] SeqNext
    10.19%     -1.50%  postgres          [.] tts_buffer_heap_getsomeattrs
               +1.42%  postgres          [.] hash_bytes_uint32
               +1.39%  postgres          [.] TupleHashTableHash
               +1.10%  postgres          [.] tts_virtual_clear
     3.36%     -0.74%  postgres          [.] ExecAgg
               +0.45%  postgres          [.] CheckForSerializableConflictOutNeeded
     0.25%     +0.44%  postgres          [.] hashint4
     5.80%     -0.35%  postgres          [.] tts_minimal_getsomeattrs
     1.91%     -0.33%  postgres          [.] heap_getnextslot
     4.86%     -0.32%  postgres          [.] heapgettup_pagemode
     1.46%     -0.32%  postgres          [.] tts_minimal_clear

While some of this is likely is just noise, it's pretty clear that we
spend a substantial amount of additional time below
lookup_hash_entries().

And looking at the code, I'm not too surprised:

Before there was basically one call from nodeAgg.c to execGrouping.c for
each tuple and hash table. Now it's a lot more complicated:
1) nodeAgg.c: prepare_hash_slot()
2) execGrouping.c: TupleHashTableHash()
3) nodeAgg.c: lookup_hash_entry()
4) execGrouping.c: LookupTupleHashEntryHash()

For each of these data needs to be peeled out of one or more of AggState
/ AggStatePerHashData / TupleHashTable. There's no way the compiler can
know that nothing inside those changes, therefore it has to reload the
contents repeatedly. By my look at the profiles, that's where most of
the time is going.

There's also the issue that the signalling whether to insert / not to
insert got unnecessarily complicated. There's several checks:
1) lookup_hash_entry() (p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;)
2) LookupTupleHashEntry_internal() (if (isnew))
3) lookup_hash_entry() (if (entry == NULL) and if (isnew))
4) lookup_hash_entries() if (!in_hash_table)

Not performance related: I am a bit confused why the new per-hash stuff
in lookup_hash_entries() isn't in lookup_hash_entry()? I assume that's
because of agg_refill_hash_table()?

Why isn't the flow more like this:
1) prepare_hash_slot()
2) if (aggstate->hash_spill_mode) goto 3; else goto 4
3) entry = LookupTupleHashEntry(&hash); if (!entry) hashagg_spill_tuple();
4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry)

That way there's again exactly one call to execGrouping.c, there's no
need for nodeAgg to separately compute the hash, there's far fewer
branches...

Doing things this way might perhaps make agg_refill_hash_table() a tiny
bit more complicated, but it'll also avoid the slowdown for the vast
majority of cases where we're not spilling.

Greetings,

Andres Freund

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#1)
Re: hashagg slowdown due to spill changes

On 2020-Jun-05, Andres Freund wrote:

While preparing my pgcon talk I noticed that our hash-agg performance
degraded noticeably. Looks to me like it's due to the spilling-hashagg
changes.

Jeff, what are your thoughts on this?

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#1)
Re: hashagg slowdown due to spill changes

On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:

Before there was basically one call from nodeAgg.c to execGrouping.c
for
each tuple and hash table. Now it's a lot more complicated:
1) nodeAgg.c: prepare_hash_slot()
2) execGrouping.c: TupleHashTableHash()
3) nodeAgg.c: lookup_hash_entry()
4) execGrouping.c: LookupTupleHashEntryHash()

The reason that I did it that way was to be able to store the hash
along with the saved tuple (similar to what HashJoin does), which
avoids recalculation.

That could be a nice savings for some cases, like when work_mem is
small but the data still fits in system memory, which I expect to be
fairly common. But based on your numbers, it might be a bad trade-off
overall.

Why isn't the flow more like this:
1) prepare_hash_slot()
2) if (aggstate->hash_spill_mode) goto 3; else goto 4
3) entry = LookupTupleHashEntry(&hash); if (!entry)
hashagg_spill_tuple();
4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry)

I'll work up a patch to refactor this. I'd still like to see if we can
preserve the calculate-hash-once behavior somehow.

Regards,
Jeff Davis

#4Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#1)
Re: hashagg slowdown due to spill changes

On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:

Why isn't the flow more like this:
1) prepare_hash_slot()
2) if (aggstate->hash_spill_mode) goto 3; else goto 4
3) entry = LookupTupleHashEntry(&hash); if (!entry)
hashagg_spill_tuple();
4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry)

I see, you are suggesting that I change around the execGrouping.c
signatures to return the hash, which will avoid the extra call. That
makes more sense.

Regards,
Jeff Davis

#5Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#3)
Re: hashagg slowdown due to spill changes

Hi,

On 2020-06-08 13:41:29 -0700, Jeff Davis wrote:

On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:

Before there was basically one call from nodeAgg.c to execGrouping.c
for
each tuple and hash table. Now it's a lot more complicated:
1) nodeAgg.c: prepare_hash_slot()
2) execGrouping.c: TupleHashTableHash()
3) nodeAgg.c: lookup_hash_entry()
4) execGrouping.c: LookupTupleHashEntryHash()

The reason that I did it that way was to be able to store the hash
along with the saved tuple (similar to what HashJoin does), which
avoids recalculation.

That makes sense. But then you can just use a separate call into
execGrouping for that purpose.

Why isn't the flow more like this:
1) prepare_hash_slot()
2) if (aggstate->hash_spill_mode) goto 3; else goto 4
3) entry = LookupTupleHashEntry(&hash); if (!entry)
hashagg_spill_tuple();
4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry)

I'll work up a patch to refactor this. I'd still like to see if we can
preserve the calculate-hash-once behavior somehow.

Cool!

Greetings,

Andres Freund

#6Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#1)
Re: hashagg slowdown due to spill changes

On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:

Hi,

While preparing my pgcon talk I noticed that our hash-agg performance
degraded noticeably. Looks to me like it's due to the spilling-
hashagg
changes.

Attached a proposed fix. (Might require some minor cleanup).

The only awkward part is that LookupTupleHashEntry() needs a new out
parameter to pass the hash value back to the caller. Ordinarily, the
caller can get that from the returned entry, but if isnew==NULL, then
the function might return NULL (and the caller wouldn't have an entry
from which to read the hash value).

Regards,
Jeff Davis

Attachments:

refactor-hashlookup.patchtext/x-patch; charset=UTF-8; name=refactor-hashlookup.patchDownload+85-90
#7Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#6)
Re: hashagg slowdown due to spill changes

Hi,

On June 10, 2020 6:15:39 PM PDT, Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2020-06-05 at 21:11 -0700, Andres Freund wrote:

Hi,

While preparing my pgcon talk I noticed that our hash-agg performance
degraded noticeably. Looks to me like it's due to the spilling-
hashagg
changes.

Attached a proposed fix. (Might require some minor cleanup).

The only awkward part is that LookupTupleHashEntry() needs a new out
parameter to pass the hash value back to the caller. Ordinarily, the
caller can get that from the returned entry, but if isnew==NULL, then
the function might return NULL (and the caller wouldn't have an entry
from which to read the hash value).

Great!

Did you run any performance tests?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#8Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#7)
Re: hashagg slowdown due to spill changes

On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote:

Did you run any performance tests?

Yes, I reproduced your ~12% regression from V12, and this patch nearly
eliminated it for me.

Regards,
Jeff Davis

#9Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#8)
Re: hashagg slowdown due to spill changes

Hi,

On 2020-06-11 11:14:02 -0700, Jeff Davis wrote:

On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote:

Did you run any performance tests?

Yes, I reproduced your ~12% regression from V12, and this patch nearly
eliminated it for me.

I spent a fair bit of time looking at the difference. Jeff had let me
know on chat that he was still seeing some difference, but couldn't
quite figure out where that was.

Trying it out myself, I observed that the patch helped, but not that
much. After a bit I found one major reason for why:
LookupTupleHashEntryHash() assigned the hash to pointer provided by the
caller's before doing the insertion. That ended up causing a pipeline
stall (I assume it's store forwarding, but not sure). Moving the
assignment to the caller variable to after the insertion got rid of
that.

It got within 3-4% after that change. I did a number of small
microoptimizations that each helped, but didn't get quite get to the
level of 12.

Finally I figured out that that's due to an issue outside of nodeAgg.c
itself:

commit 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0
Author: Tomas Vondra <tomas.vondra@postgresql.org>
Date: 2020-05-31 14:43:13 +0200

Use CP_SMALL_TLIST for hash aggregate

Due to this change we end up with an additional projection in queries
like this:

postgres[212666][1]=# \d fewgroups_many_rows
Table "public.fewgroups_many_rows"
┌────────┬─────────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├────────┼─────────┼───────────┼──────────┼─────────┤
│ cat │ integer │ │ not null │ │
│ val │ integer │ │ not null │ │
└────────┴─────────┴───────────┴──────────┴─────────┘

postgres[212666][1]=# explain SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1;
┌───────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├───────────────────────────────────────────────────────────────────────────────────────┤
│ HashAggregate (cost=1942478.48..1942478.53 rows=5 width=12) │
│ Group Key: cat │
│ -> Seq Scan on fewgroups_many_rows (cost=0.00..1442478.32 rows=100000032 width=4) │
└───────────────────────────────────────────────────────────────────────────────────────┘
(3 rows)

as 'val' is "projected away"..

After neutering the tlist change, Jeff's patch and my changes to it
yield performance *above* v12.

I don't see why it's ok to force an additional projection in the very
common case of hashaggs over a few rows. So I think we need to rethink
4cad2534da6.

Greetings,

Andres Freund

Attachments:

aggspeed.difftext/x-diff; charset=us-asciiDownload+112-104
#10Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#9)
Re: hashagg slowdown due to spill changes

On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote:

I don't see why it's ok to force an additional projection in the very
common case of hashaggs over a few rows. So I think we need to
rethink
4cad2534da6.

One possibility is to project only spilled tuples, more similar to
Melanie's patch from a while ago:

/messages/by-id/CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com

Which makes sense, but it's also more code.

Regards,
Jeff Davis

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#10)
Re: hashagg slowdown due to spill changes

On Fri, Jun 12, 2020 at 03:29:08PM -0700, Jeff Davis wrote:

On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote:

I don't see why it's ok to force an additional projection in the very
common case of hashaggs over a few rows. So I think we need to
rethink
4cad2534da6.

One possibility is to project only spilled tuples, more similar to
Melanie's patch from a while ago:

/messages/by-id/CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com

Which makes sense, but it's also more code.

I agree, we should revert 4cad2534da and only project tuples when we
actually need to spill them. Did any of the WIP patches actually
implement that, or do we need to write that patch from scratch?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#11)
Re: hashagg slowdown due to spill changes

Hi,

On 2020-06-13 01:06:25 +0200, Tomas Vondra wrote:

I agree, we should revert 4cad2534da and only project tuples when we
actually need to spill them.

There are cases where projecting helps for non-spilling aggregates too,
but only for the representative tuple. It doesn't help in the case at
hand, because there's just 5 hashtable entries but millions of rows. So
we're unnecessarily projecting all-5 rows. But when there are many
different groups, it'd be different, because then the size of the
representative tuple can matter substantially.

Do you think we should tackle this for 13? To me 4cad2534da seems like a
somewhat independent improvement to spillable hashaggs.

Greetings,

Andres Freund

#13Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#12)
Re: hashagg slowdown due to spill changes

On Fri, 2020-06-12 at 17:12 -0700, Andres Freund wrote:

Do you think we should tackle this for 13? To me 4cad2534da seems
like a
somewhat independent improvement to spillable hashaggs.

We've gone back and forth on this issue a few times, so let's try to
get some agreement before we revert 4cad2534da. I added Robert because
he also seemed to think it was a reasonable idea.

Regards,
Jeff Davis

#14Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#13)
Re: hashagg slowdown due to spill changes

On Sat, Jun 13, 2020 at 11:48:09AM -0700, Jeff Davis wrote:

On Fri, 2020-06-12 at 17:12 -0700, Andres Freund wrote:

Do you think we should tackle this for 13? To me 4cad2534da seems
like a
somewhat independent improvement to spillable hashaggs.

We've gone back and forth on this issue a few times, so let's try to
get some agreement before we revert 4cad2534da. I added Robert because
he also seemed to think it was a reasonable idea.

I can't speak for Robert, but I haven't expected the extra projection
would be this high. And I agree with Andres it's not very nice we have
to do this even for aggregates with just a handful of groups that don't
need to spill.

In any case, I think we need to address this somehow for v13 - either we
keep the 4cad2534da patch in, or we tweak the cost model to reflect the
extra I/O costs, or we project only when spilling.

I'm not in a position to whip up a patch soon, though :-(

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#10)
Re: hashagg slowdown due to spill changes

Hi,

On 2020-06-12 15:29:08 -0700, Jeff Davis wrote:

On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote:

I don't see why it's ok to force an additional projection in the very
common case of hashaggs over a few rows. So I think we need to
rethink
4cad2534da6.

One possibility is to project only spilled tuples, more similar to
Melanie's patch from a while ago:

/messages/by-id/CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com

Which makes sense, but it's also more code.

I'm somewhat inclined to think that we should revert 4cad2534da6 and
then look at how precisely to tackle this in 14.

It'd probably make sense to request small tlists when the number of
estimated groups is large, and not otherwise.

Greetings,

Andres Freund

#16Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#15)
Re: hashagg slowdown due to spill changes

On Sun, 2020-06-14 at 11:14 -0700, Andres Freund wrote:

I'm somewhat inclined to think that we should revert 4cad2534da6 and
then look at how precisely to tackle this in 14.

I'm fine with that.

It'd probably make sense to request small tlists when the number of
estimated groups is large, and not otherwise.

That seems like a nice compromise that would be non-invasive, at least
for create_agg_plan().

Regards,
Jeff Davis

#17Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#16)
Re: hashagg slowdown due to spill changes

On Sun, Jun 14, 2020 at 11:09:55PM -0700, Jeff Davis wrote:

On Sun, 2020-06-14 at 11:14 -0700, Andres Freund wrote:

I'm somewhat inclined to think that we should revert 4cad2534da6 and
then look at how precisely to tackle this in 14.

I'm fine with that.

I don't see how we could just revert 4cad2534d and leave this for v14.

The hashagg spilling is IMHO almost guaranteed to be a pain point for
some users, as it will force some queries to serialize large amounts of
data. Yes, some of this is a cost for hashagg enforcing work_mem at
runtime, I'm fine with that. We'd get reports about that too, but we can
justify that cost ...

But just reverting 4cad2534d will make this much worse, I think, as
illustrated by the benchmarks I did in [1]/messages/by-id/20200519151202.u2p2gpiawoaznsv2@development. And no, this is not really
fixable by tweaking the cost parameters - even with the current code
(i.e. 4cad2534d in place) I had to increase random_page_cost to 60 on
the temp tablespace (on SATA RAID) to get good plans with parallelism
enabled. I haven't tried, but I presume without 4cad2534d I'd have to
push r_p_c even further ...

[1]: /messages/by-id/20200519151202.u2p2gpiawoaznsv2@development

It'd probably make sense to request small tlists when the number of
estimated groups is large, and not otherwise.

That seems like a nice compromise that would be non-invasive, at least
for create_agg_plan().

Maybe. It'd certainly better than nothing. It's not clear to me what
would a good threshold be, though. And it's not going to handle cases of
under-estimates.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#17)
Re: hashagg slowdown due to spill changes

On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

But just reverting 4cad2534d will make this much worse, I think, as
illustrated by the benchmarks I did in [1].

I share this concern, although I do not know what we should do about it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#18)
Re: hashagg slowdown due to spill changes

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

But just reverting 4cad2534d will make this much worse, I think, as
illustrated by the benchmarks I did in [1].

I share this concern, although I do not know what we should do about it.

Well, it's only June. Let's put it on the open issues list for v13
and continue to think about it. I concur that the hashagg spill patch
has made this something that we should worry about for v13, so just
reverting without a better answer isn't very appetizing.

regards, tom lane

#20Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#18)
Re: hashagg slowdown due to spill changes

On Mon, 2020-06-15 at 11:19 -0400, Robert Haas wrote:

On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

But just reverting 4cad2534d will make this much worse, I think, as
illustrated by the benchmarks I did in [1].

I share this concern, although I do not know what we should do about
it.

I attached an updated version of Melanie's patch, combined with the
changes to copy only the necessary attributes to a new slot before
spilling. There are a couple changes:

* I didn't see a reason to descend into a GroupingFunc node, so I
removed that.

* I used a flag in the context rather than two separate callbacks to
the expression walker.

This patch gives the space benefits that we see on master, without the
regression for small numbers of tuples. I saw a little bit of noise in
my test results, but I'm pretty sure it's a win all around. It could
use some review/cleanup though.

Regards,
Jeff Davis

Attachments:

hashagg-spill-project.patchtext/x-patch; charset=UTF-8; name=hashagg-spill-project.patchDownload+96-35
#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#20)
#22Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#20)
#23Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#22)
#24Andres Freund
andres@anarazel.de
In reply to: Melanie Plageman (#23)
#25Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#24)
#26Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Melanie Plageman (#25)
#27Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#9)
In reply to: Andres Freund (#27)
In reply to: Peter Geoghegan (#28)
#30Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#29)
In reply to: Jeff Davis (#30)