significant slowdown of HashAggregate between 9.6 and 10

Started by Pavel Stehuleover 5 years ago9 messages

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

One czech Postgres user reported performance issue related to speed
HashAggregate in nested loop.

The speed of 9.6

HashAggregate (cost=27586.10..27728.66 rows=14256 width=24)
(actual time=0.003..0.049 rows=39 loops=599203)

The speed of 10.7

HashAggregate (cost=27336.78..27552.78 rows=21600 width=24)
(actual time=0.011..0.156 rows=38 loops=597137)

So it looks so HashAgg is about 3x slower - with brutal nested loop it is a
problem.

I wrote simple benchmark and really looks so our hash aggregate is slower
and slower.

create table foo(a int, b int, c int, d int, e int, f int);
insert into foo select random()*1000, random()*4, random()*4, random()* 2,
random()*100, random()*100 from generate_series(1,2000000);

analyze foo;

9.6.7
postgres=# explain (analyze, buffers) select i from
generate_series(1,500000) g(i) where exists (select count(*) cx from foo
group by b, c, d having count(*) = i);
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN
│
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Function Scan on generate_series g (cost=0.00..57739020.00 rows=500
width=4) (actual time=807.485..3364.515 rows=74 loops=1) │
│ Filter: (SubPlan 1)
│
│ Rows Removed by Filter: 499926
│
│ Buffers: shared hit=12739, temp read=856 written=855
│
│ SubPlan 1
│
│ -> HashAggregate (cost=57739.00..57739.75 rows=75 width=20) (actual
time=0.006..0.006 rows=0 loops=500000) │
│ Group Key: foo.b, foo.c, foo.d
│
│ Filter: (count(*) = g.i)
│
│ Rows Removed by Filter: 75
│
│ Buffers: shared hit=12739
│
│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000
width=12) (actual time=0.015..139.736 rows=2000000 loops=1) │
│ Buffers: shared hit=12739
│
│ Planning time: 0.276 ms
│
│ Execution time: 3365.758 ms
│
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)

10.9

postgres=# explain (analyze, buffers) select i from
generate_series(1,500000) g(i) where exists (select count(*) cx from foo
group by b, c, d having count(*) = i);
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN
│
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Function Scan on generate_series g (cost=0.00..57739020.00 rows=500
width=4) (actual time=825.468..4919.063 rows=74 loops=1) │
│ Filter: (SubPlan 1)
│
│ Rows Removed by Filter: 499926
│
│ Buffers: shared hit=12739, temp read=856 written=855
│
│ SubPlan 1
│
│ -> HashAggregate (cost=57739.00..57739.75 rows=75 width=20) (actual
time=0.009..0.009 rows=0 loops=500000) │
│ Group Key: foo.b, foo.c, foo.d
│
│ Filter: (count(*) = g.i)
│
│ Rows Removed by Filter: 75
│
│ Buffers: shared hit=12739
│
│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000
width=12) (actual time=0.025..157.887 rows=2000000 loops=1) │
│ Buffers: shared hit=12739
│
│ Planning time: 0.829 ms
│
│ Execution time: 4920.800 ms
│
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)

master

╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
│ Function Scan on generate_series g (cost=0.00..28869973750.00
rows=250000 width=4) (actual time=901.639..6057.943 rows=74 loops=1)
│ Filter: (SubPlan 1)

│ Rows Removed by Filter: 499926

│ Buffers: shared hit=12739, temp read=855 written=855

│ SubPlan 1

│ -> HashAggregate (cost=57739.00..57739.94 rows=1 width=20) (actual
time=0.012..0.012 rows=0 loops=500000)
│ Group Key: foo.b, foo.c, foo.d

│ Filter: (count(*) = g.i)

│ Peak Memory Usage: 37 kB

│ Rows Removed by Filter: 75

│ Buffers: shared hit=12739

│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000
width=12) (actual time=0.017..262.497 rows=2000000 loops=1)
│ Buffers: shared hit=12739

│ Planning Time: 0.275 ms

│ Buffers: shared hit=1

│ Execution Time: 6059.266 ms

└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(16 rows)

Regards

Pavel

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Pavel Stehule (#1)

Re: significant slowdown of HashAggregate between 9.6 and 10

st 3. 6. 2020 v 17:32 odesílatel Pavel Stehule <pavel.stehule@gmail.com>
napsal:

Hi

One czech Postgres user reported performance issue related to speed
HashAggregate in nested loop.

The speed of 9.6

HashAggregate (cost=27586.10..27728.66 rows=14256 width=24)
(actual time=0.003..0.049 rows=39 loops=599203)

The speed of 10.7

HashAggregate (cost=27336.78..27552.78 rows=21600 width=24)
(actual time=0.011..0.156 rows=38 loops=597137)

So it looks so HashAgg is about 3x slower - with brutal nested loop it is
a problem.

I wrote simple benchmark and really looks so our hash aggregate is slower
and slower.

create table foo(a int, b int, c int, d int, e int, f int);
insert into foo select random()*1000, random()*4, random()*4, random()* 2,
random()*100, random()*100 from generate_series(1,2000000);

analyze foo;

9.6.7
postgres=# explain (analyze, buffers) select i from
generate_series(1,500000) g(i) where exists (select count(*) cx from foo
group by b, c, d having count(*) = i);

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN
│

╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Function Scan on generate_series g (cost=0.00..57739020.00 rows=500
width=4) (actual time=807.485..3364.515 rows=74 loops=1) │
│ Filter: (SubPlan 1)
│
│ Rows Removed by Filter: 499926
│
│ Buffers: shared hit=12739, temp read=856 written=855
│
│ SubPlan 1
│
│ -> HashAggregate (cost=57739.00..57739.75 rows=75 width=20)
(actual time=0.006..0.006 rows=0 loops=500000) │
│ Group Key: foo.b, foo.c, foo.d
│
│ Filter: (count(*) = g.i)
│
│ Rows Removed by Filter: 75
│
│ Buffers: shared hit=12739
│
│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000
width=12) (actual time=0.015..139.736 rows=2000000 loops=1) │
│ Buffers: shared hit=12739
│
│ Planning time: 0.276 ms
│
│ Execution time: 3365.758 ms
│

└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)

10.9

postgres=# explain (analyze, buffers) select i from
generate_series(1,500000) g(i) where exists (select count(*) cx from foo
group by b, c, d having count(*) = i);

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN
│

╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Function Scan on generate_series g (cost=0.00..57739020.00 rows=500
width=4) (actual time=825.468..4919.063 rows=74 loops=1) │
│ Filter: (SubPlan 1)
│
│ Rows Removed by Filter: 499926
│
│ Buffers: shared hit=12739, temp read=856 written=855
│
│ SubPlan 1
│
│ -> HashAggregate (cost=57739.00..57739.75 rows=75 width=20)
(actual time=0.009..0.009 rows=0 loops=500000) │
│ Group Key: foo.b, foo.c, foo.d
│
│ Filter: (count(*) = g.i)
│
│ Rows Removed by Filter: 75
│
│ Buffers: shared hit=12739
│
│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000
width=12) (actual time=0.025..157.887 rows=2000000 loops=1) │
│ Buffers: shared hit=12739
│
│ Planning time: 0.829 ms
│
│ Execution time: 4920.800 ms
│

└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)

master

postgres=# explain (analyze, buffers) select i from
generate_series(1,500000) g(i) where exists (select count(*) cx from foo
group by b, c, d having count(*) = i);

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ QUERY PLAN

╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
│ Function Scan on generate_series g (cost=0.00..28869973750.00
rows=250000 width=4) (actual time=901.639..6057.943 rows=74 loops=1)
│ Filter: (SubPlan 1)

│ Rows Removed by Filter: 499926

│ Buffers: shared hit=12739, temp read=855 written=855

│ SubPlan 1

│ -> HashAggregate (cost=57739.00..57739.94 rows=1 width=20) (actual
time=0.012..0.012 rows=0 loops=500000)
│ Group Key: foo.b, foo.c, foo.d

│ Filter: (count(*) = g.i)

│ Peak Memory Usage: 37 kB

│ Rows Removed by Filter: 75

│ Buffers: shared hit=12739

│ -> Seq Scan on foo (cost=0.00..32739.00 rows=2000000
width=12) (actual time=0.017..262.497 rows=2000000 loops=1)
│ Buffers: shared hit=12739

│ Planning Time: 0.275 ms

│ Buffers: shared hit=1

│ Execution Time: 6059.266 ms

└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(16 rows)

Regards

I tried to run same query on half data size, and the performance is almost
same. Probably the performance issue can be related to initialization or
finalization of aggregation.

Pavel

Show quoted text

Pavel

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Pavel Stehule (#2)

Re: significant slowdown of HashAggregate between 9.6 and 10

Hi,

Not sure what's the root cause, but I can reproduce it. Timings for 9.6,
10 and master (all built from git with the same options) without explain
analyze look like this:

9.6
-----------------
Time: 1971.314 ms
Time: 1995.875 ms
Time: 1997.408 ms
Time: 2069.913 ms
Time: 2004.196 ms

10
-----------------------------
Time: 2815.434 ms (00:02.815)
Time: 2862.589 ms (00:02.863)
Time: 2841.126 ms (00:02.841)
Time: 2803.040 ms (00:02.803)
Time: 2805.527 ms (00:02.806)

master
-----------------------------
Time: 3479.233 ms (00:03.479)
Time: 3537.901 ms (00:03.538)
Time: 3459.314 ms (00:03.459)
Time: 3542.810 ms (00:03.543)
Time: 3482.141 ms (00:03.482)

So there seems to be +40% between 9.6 and 10, and further +25% between
10 and master. However, plain hashagg, measured e.g. like this:

select count(*) cx from foo group by b, c, d having count(*) = 1;

does not indicate any slowdown at all, so I think you're right it has
something to do with the looping.

Profiles from those versions look like this:

9.6
---------------------------------------------------------
Samples
Overhead Shared Objec Symbol
14.19% postgres [.] ExecMakeFunctionResultNoSets
13.65% postgres [.] finalize_aggregates
12.54% postgres [.] hash_seq_search
6.70% postgres [.] finalize_aggregate.isra.0
5.71% postgres [.] ExecEvalParamExec
5.54% postgres [.] ExecEvalAggref
5.00% postgres [.] ExecStoreMinimalTuple
4.34% postgres [.] ExecAgg
4.08% postgres [.] ExecQual
2.67% postgres [.] slot_deform_tuple
2.24% postgres [.] pgstat_init_function_usage
2.22% postgres [.] check_stack_depth
2.14% postgres [.] MemoryContextReset
1.89% postgres [.] hash_search_with_hash_value
1.72% postgres [.] project_aggregates
1.68% postgres [.] pgstat_end_function_usage
1.59% postgres [.] slot_getattr

10
------------------------------------------------------------
Samples
Overhead Shared Object Symbol
15.18% postgres [.] slot_deform_tuple
13.09% postgres [.] agg_retrieve_hash_table
12.02% postgres [.] ExecInterpExpr
7.47% postgres [.] finalize_aggregates
7.38% postgres [.] tuplehash_iterate
5.13% postgres [.] prepare_projection_slot
4.86% postgres [.] finalize_aggregate.isra.0
4.05% postgres [.] bms_is_member
3.97% postgres [.] slot_getallattrs
3.59% postgres [.] ExecStoreMinimalTuple
2.85% postgres [.] project_aggregates
1.95% postgres [.] ExecClearTuple
1.71% libc-2.30.so [.] __memset_avx2_unaligned_erms
1.69% postgres [.] ExecEvalParamExec
1.58% postgres [.] MemoryContextReset
1.17% postgres [.] slot_getattr
1.03% postgres [.] slot_getsomeattrs

master
--------------------------------------------------------------
Samples
Overhead Shared Object Symbol
17.07% postgres [.] agg_retrieve_hash_table
15.46% postgres [.] tuplehash_iterate
11.83% postgres [.] tts_minimal_getsomeattrs
9.39% postgres [.] ExecInterpExpr
6.94% postgres [.] prepare_projection_slot
4.85% postgres [.] finalize_aggregates
4.27% postgres [.] bms_is_member
3.80% postgres [.] finalize_aggregate.isra.0
3.80% postgres [.] tts_minimal_store_tuple
2.22% postgres [.] project_aggregates
2.07% postgres [.] tts_virtual_clear
2.07% postgres [.] MemoryContextReset
1.78% postgres [.] tts_minimal_clear
1.61% postgres [.] ExecEvalParamExec
1.46% postgres [.] slot_getsomeattrs_int
1.34% libc-2.30.so [.] __memset_avx2_unaligned_erms

Not sure what to think about this. Seems slot_deform_tuple got way more
expensive between 9.6 and 10, for some reason.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Tomas Vondra (#3)

Re: significant slowdown of HashAggregate between 9.6 and 10

Hi,

On 2020-06-03 21:31:01 +0200, Tomas Vondra wrote:

So there seems to be +40% between 9.6 and 10, and further +25% between
10 and master. However, plain hashagg, measured e.g. like this:

Ugh.

Since I am a likely culprit, I'm taking a look.

Not sure what to think about this. Seems slot_deform_tuple got way more
expensive between 9.6 and 10, for some reason.

Might indicate too many calls instead. Or it could just be the fact that
we have expressions deform all columns once, instead of deforming
columns one-by-one now.

Greetings,

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Andres Freund (#4)

Re: significant slowdown of HashAggregate between 9.6 and 10

Hi,

On 2020-06-03 13:26:43 -0700, Andres Freund wrote:

On 2020-06-03 21:31:01 +0200, Tomas Vondra wrote:

So there seems to be +40% between 9.6 and 10, and further +25% between
10 and master. However, plain hashagg, measured e.g. like this:

As far as I can tell the 10->master difference comes largely from the
difference of the number of buckets in the hashtable.

In 10 it is:
Breakpoint 1, tuplehash_create (ctx=0x5628251775c8, nelements=75, private_data=0x5628251952f0)
and in master it is:
Breakpoint 1, tuplehash_create (ctx=0x5628293a0a70, nelements=256, private_data=0x5628293a0b90)

As far as I can tell the timing difference simply is the cost of
iterating 500k times over a hashtable with fairly few entries. Which is,
unsurprisingly, more expensive if the hashtable is larger.

The reason the hashtable got bigger in 12 is

commit 1f39bce021540fde00990af55b4432c55ef4b3c7
Author: Jeff Davis <jdavis@postgresql.org>
Date: 2020-03-18 15:42:02 -0700

Disk-based Hash Aggregation.

which introduced

+/* minimum number of initial hash table buckets */
+#define HASHAGG_MIN_BUCKETS 256

I don't really see much explanation for that part in the commit, perhaps
Jeff can chime in?

I think optimizing for the gazillion hash table scans isn't particularly
important. Rarely is a query going to have 500k scans of unchanging
aggregated data. So I'm not too concerned about the 13 regression - but
I also see very little reason to just always use 256 buckets? It's
pretty darn common to end up with 1-2 groups, what's the point of this?

I'll look into 9.6->10 after buying groceries... But I'd wish there were
a relevant benchmark, I don't think it's worth optimizing for this case.

Greetings,

Andres Freund

Jeff Davis

pgsql@j-davis.com

over 5 years ago

In reply to: Andres Freund (#5)

Re: significant slowdown of HashAggregate between 9.6 and 10

On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote:

+/* minimum number of initial hash table buckets */
+#define HASHAGG_MIN_BUCKETS 256
I don't really see much explanation for that part in the commit,
perhaps
Jeff can chime in?

I did this in response to a review comment (point #5):

/messages/by-id/20200219191636.gvdywx32kwbix6kd@development

Tomas suggested a min of 1024, and I thought I was being more
conservative choosing 256. Still too high, I guess?

I can lower it. What do you think is a reasonable minimum?

Regards,
Jeff Davis

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Jeff Davis (#6)

Re: significant slowdown of HashAggregate between 9.6 and 10

Hi,

On 2020-06-04 18:22:03 -0700, Jeff Davis wrote:

On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote:
+/* minimum number of initial hash table buckets */
+#define HASHAGG_MIN_BUCKETS 256
I don't really see much explanation for that part in the commit,
perhaps
Jeff can chime in?
I did this in response to a review comment (point #5):

/messages/by-id/20200219191636.gvdywx32kwbix6kd@development

Tomas suggested a min of 1024, and I thought I was being more
conservative choosing 256. Still too high, I guess?

I can lower it. What do you think is a reasonable minimum?

I don't really see why there needs to be a minimum bigger than 1?

Greetings,

Andres Freund

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Andres Freund (#7)

Re: significant slowdown of HashAggregate between 9.6 and 10

On Thu, Jun 04, 2020 at 06:57:58PM -0700, Andres Freund wrote:

Hi,

On 2020-06-04 18:22:03 -0700, Jeff Davis wrote:
On Thu, 2020-06-04 at 11:41 -0700, Andres Freund wrote:
+/* minimum number of initial hash table buckets */
+#define HASHAGG_MIN_BUCKETS 256
I don't really see much explanation for that part in the commit,
perhaps
Jeff can chime in?
I did this in response to a review comment (point #5):

/messages/by-id/20200219191636.gvdywx32kwbix6kd@development

Tomas suggested a min of 1024, and I thought I was being more
conservative choosing 256. Still too high, I guess?
I can lower it. What do you think is a reasonable minimum?

I don't really see why there needs to be a minimum bigger than 1?

I think you're right. I think I was worried about having to resize the
hash table in case of an under-estimate, and it seemed fine to waste a
tiny bit more memory to prevent that. But this example shows we may need
to scan the hash table sequentially, which means it's not just about
memory consumption. So in hindsight we either don't need the limit at
all, or maybe it could be much lower (IIRC it reduces probability of
collision, but maybe dynahash does that anyway internally).

I wonder if hashjoin has the same issue, but probably not - I don't
think we'll ever scan that internal hash table sequentially.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Tomas Vondra (#8)

Re: significant slowdown of HashAggregate between 9.6 and 10

Hi,

On 2020-06-05 15:25:26 +0200, Tomas Vondra wrote:

I think you're right. I think I was worried about having to resize the
hash table in case of an under-estimate, and it seemed fine to waste a
tiny bit more memory to prevent that.

It's pretty cheap to resize a hashtable with a handful of entries, so I'm not
worried about that. It's also how it has worked for a *long* time, so I think
unless we have some good reason to change that, I wouldn't.

But this example shows we may need to scan the hash table
sequentially, which means it's not just about memory consumption.

We *always* scan the hashtable sequentially, no? Otherwise there's no way to
get at the aggregated data.

So in hindsight we either don't need the limit at all, or maybe it
could be much lower (IIRC it reduces probability of collision, but
maybe dynahash does that anyway internally).

This is simplehash using code. Which resizes on a load factor of 0.9.

I wonder if hashjoin has the same issue, but probably not - I don't
think we'll ever scan that internal hash table sequentially.

I think we do for some outer joins (c.f. ExecPrepHashTableForUnmatched()), but
it's probably not relevant performance-wise.

Greetings,

Andres Freund