Macro customizable hashtable / bitmapscan & aggregation perf

Started by Andres Freundover 9 years ago23 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

As previously mentioned, dynahash isn't particularly fast. In many cases
that doesn't particularly matter. But e.g. hash aggregation and bitmap
index scans are quite sensitive to hash performance.

The biggest problems with dynahash (also discussed at [1]http://archives.postgresql.org/message-id/20160714030616.dvlkboz6cw555ixb%40alap3.anarazel.de) are

a) two level structure doubling the amount of indirect lookups
b) indirect function calls
c) using separate chaining based conflict resolution
d) being too general.

In the post referenced above I'd implemented an open-coded hashtable
addressing these issues for the tidbitmap.c case, with quite some
success.

It's not particularly desirable to have various slightly differing
hash-table implementations across the backend though. The only solution
I could think of, that preserves the efficiency, is to have a hash-table
implementation which is customizable to the appropriate types et, via
macros.

In the attached patch I've attached simplehash.h, which can be
customized by a bunch of macros, before being inlined. There's also a
patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
execGrouping.c.

To show the motivation:
Bitmapscan:
before:
tpch[22005][1]http://archives.postgresql.org/message-id/20160714030616.dvlkboz6cw555ixb%40alap3.anarazel.de=# EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date);
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=942330.27..942330.28 rows=1 width=8) (actual time=5283.026..5283.026 rows=1 loops=1) │
│ -> Bitmap Heap Scan on lineitem (cost=193670.02..919511.38 rows=9127557 width=8) (actual time=3041.072..4436.569 rows=9113219 loops=1) │
│ Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │
│ Heap Blocks: exact=585542 │
│ -> Bitmap Index Scan on i_l_shipdate (cost=0.00..191388.13 rows=9127557 width=0) (actual time=2807.246..2807.246 rows=9113219 loops=1) │
│ Index Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │
│ Planning time: 0.077 ms │
│ Execution time: 5284.038 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(8 rows)

after:
tpch[21953][1]http://archives.postgresql.org/message-id/20160714030616.dvlkboz6cw555ixb%40alap3.anarazel.de=# EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date);
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=942330.27..942330.28 rows=1 width=8) (actual time=3431.602..3431.602 rows=1 loops=1) │
│ -> Bitmap Heap Scan on lineitem (cost=193670.02..919511.38 rows=9127557 width=8) (actual time=1158.783..2588.357 rows=9113219 loops=1) │
│ Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │
│ Heap Blocks: exact=585542 │
│ -> Bitmap Index Scan on i_l_shipdate (cost=0.00..191388.13 rows=9127557 width=0) (actual time=958.341..958.341 rows=9113219 loops=1) │
│ Index Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │
│ Planning time: 0.070 ms │
│ Execution time: 3435.259 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(8 rows)

Hash-Agg:

before:

tpch[22005][1]http://archives.postgresql.org/message-id/20160714030616.dvlkboz6cw555ixb%40alap3.anarazel.de=# EXPLAIN (ANALYZE, TIMING OFF) SELECT l_partkey, SUM(l_extendedprice) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=1069653.05..1069653.08 rows=10 width=12) (actual rows=10 loops=1) │
│ -> Sort (cost=1069653.05..1072083.33 rows=972112 width=12) (actual rows=10 loops=1) │
│ Sort Key: (sum(l_extendedprice)) DESC │
│ Sort Method: top-N heapsort Memory: 25kB │
│ -> HashAggregate (cost=1038924.94..1048646.06 rows=972112 width=12) (actual rows=1000000 loops=1) │
│ Group Key: l_partkey │
│ -> Seq Scan on lineitem (cost=0.00..888925.96 rows=29999796 width=12) (actual rows=29999795 loops=1) │
│ Planning time: 0.210 ms │
│ Execution time: 20887.906 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

after:

tpch[22342][1]http://archives.postgresql.org/message-id/20160714030616.dvlkboz6cw555ixb%40alap3.anarazel.de=# EXPLAIN (ANALYZE, TIMING OFF) SELECT l_partkey, SUM(l_extendedprice) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=1069653.05..1069653.08 rows=10 width=12) (actual rows=10 loops=1) │
│ -> Sort (cost=1069653.05..1072083.33 rows=972112 width=12) (actual rows=10 loops=1) │
│ Sort Key: (sum(l_extendedprice)) DESC │
│ Sort Method: top-N heapsort Memory: 25kB │
│ -> HashAggregate (cost=1038924.94..1048646.06 rows=972112 width=12) (actual rows=1000000 loops=1) │
│ Group Key: l_partkey │
│ -> Seq Scan on lineitem (cost=0.00..888925.96 rows=29999796 width=12) (actual rows=29999795 loops=1) │
│ Planning time: 0.104 ms │
│ Execution time: 14617.481 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

The hash-table performance difference itself is bigger in both cases,
but other things, notably slot deforming and expression evaluation,
becomes the bottleneck.

Does anybody have a better idea how to approach the hash-table problem?
While it'd be nice not to resort to such macro-magic, I can't think of
an alternative short of switching to C++ and using templates. Currently
this is used like:

#define SH_PREFIX pagetable
#define SH_KEYTYPE BlockNumber
#define SH_KEY blockno
#define SH_CONTAINS PagetableEntry
#define SH_HASH_KEY(tb, key) hash_blockno(key)
#define SH_EQUAL(tb, a, b) a == b
#define SH_SCOPE static
#define SH_DEFINE
#define SH_DECLARE
#include "lib/simplehash.h"

which then creates functions like pagetable_create/insert/lookup/delete/...

I strongly suspect there are some other cases that could benefit from
another hash-table implementation. Particularly catcache.c seems like a
candidate (instead of it's hand-rolled stuff, which frequently shows up
in profiles).

Note that these patches are *NOT* in a shape for in-detail review. I'd
first like to know what people think about the general approach, and
about better alternatives.

Also note that some queries in the regression test change their result,
because the ordering is unspecified...

Greetings,

Andres Freund

[1]: http://archives.postgresql.org/message-id/20160714030616.dvlkboz6cw555ixb%40alap3.anarazel.de

Attachments:

0001-WIP-Add-likely-unlikely-macros.patchtext/x-patch; charset=us-asciiDownload+9-1
0002-WIP-Add-an-inline-macro-customizable-hashtable.patchtext/x-patch; charset=us-asciiDownload+407-1
0003-WIP-Use-more-efficient-hashtable-for-tidbitmap.c.patchtext/x-patch; charset=us-asciiDownload+59-51
0004-WIP-execGrouping.c-simplehash.patchtext/x-patch; charset=us-asciiDownload+124-136
#2Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#1)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

On Tue, Jul 26, 2016 at 8:43 PM, Andres Freund <andres@anarazel.de> wrote:

As previously mentioned, dynahash isn't particularly fast. In many cases
that doesn't particularly matter. But e.g. hash aggregation and bitmap
index scans are quite sensitive to hash performance.

The biggest problems with dynahash (also discussed at [1]) are

a) two level structure doubling the amount of indirect lookups
b) indirect function calls
c) using separate chaining based conflict resolution
d) being too general.

I am ... skeptical about open addressing. It's less unappealing for
this application than in general because we don't actually need to
delete anything, but that wouldn't be true for the catcache. All the
same, I feel that we need to assess the risk that we're improving
average-case performance while creating large regressions in other
cases (i.e. where there is clustering).

Do likely() and unlikely() actually buy us anything here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#2)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

On 2016-07-27 10:04:52 -0400, Robert Haas wrote:

On Tue, Jul 26, 2016 at 8:43 PM, Andres Freund <andres@anarazel.de> wrote:

As previously mentioned, dynahash isn't particularly fast. In many cases
that doesn't particularly matter. But e.g. hash aggregation and bitmap
index scans are quite sensitive to hash performance.

The biggest problems with dynahash (also discussed at [1]) are

a) two level structure doubling the amount of indirect lookups
b) indirect function calls
c) using separate chaining based conflict resolution
d) being too general.

I am ... skeptical about open addressing. It's less unappealing for
this application than in general because we don't actually need to
delete anything, but that wouldn't be true for the catcache.

I don't think deletions really change the picture for open addressing -
you don't need toombstones, you can instead move elements closer to
their origin at deletion. I just hadn't implemented that yet, because it
didn't seem like a crucial point.

Unfortunately open addressing, particularly linear one, has such vastly
superiour cache access patterns, that it's hard to come close with a
chained hash table. I've previously tried a chained hash table, and the
performance gains were a lot smaller. For that reason many hash-table
implementations used in performance critical paths appear to be open
addressing.

Don't get me wrong, I do certainly think that there's good cases for
using separate chaining in places where performance is less of a
concern; and possibly when you want lock-free concurrency.

All the same, I feel that we need to assess the risk that we're
improving average-case performance while creating large regressions in
other cases (i.e. where there is clustering).

The actual algorithmic worst-case complexity isn't different. It's
obviously important to resize appropriately. It's not that hard to beat
dynahash's memory efficiency, so fillfactors don't have to be kept super
high to not regress in memory usage.

Do likely() and unlikely() actually buy us anything here?

It's only a few percent here, but overall, especially when used around
error checks, it's above 10%. The reason I used it here is primary that
it made the assembly easier to read for me... ;)

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

Hi,

On 2016-07-26 17:43:33 -0700, Andres Freund wrote:

In the attached patch I've attached simplehash.h, which can be
customized by a bunch of macros, before being inlined. There's also a
patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
execGrouping.c.

Attached is a significantly improved version of this series. The main
changes are:

- The hash table now uses robin-hood style hash collision handling. See the
commit message of 0002 (or simplehash.h) for an introduction. That
significantly reduces the impact of "clustering" due to linear
addressing.
- Significant comment and correctness fixes, both in simplehash.h
- itself, and 0003/0004.
- a lot of other random performance improvements for the hashing code.

Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.

This patch series currently consists out of four patches:
- 0001 - add unlikely/likely() macros. The use here is not entirely
mandatory, but they do provide a quite consistent improvement. There
are other threads where they proved to be beneficial, so I see little
reason not to introduce them.
- 0002 - the customizable hashtable itself. It now contains documentation.
- 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
for the issue mentioned in [1]http://archives.postgresql.org/message-id/20160923205843.zcs533sqdzlh4cpo%40alap3.anarazel.de, to improve peformance in heavily lossy
scenarios (otherwise we could regress in some extreme cases)
- 0004 - convert execGrouping.c, e.g. used by hash aggregates

While not quite perfect yet, I do think this is at a state where input
is needed. I don't want to go a lot deeper into this rabbit hole,
before we have some agreement that this is a sensible course of action.

I think there's several possible additional users of simplehash.h,
e.g. catcache.c - which has an open coded, not particularly fast hash
table and frequently shows up in profiles - but I think the above two
conversions are plenty to start with.

Comments?

Greetings,

Andres Freund

[1]: http://archives.postgresql.org/message-id/20160923205843.zcs533sqdzlh4cpo%40alap3.anarazel.de

Attachments:

0001-Add-likely-unlikely-branch-hint-macros.patchtext/x-patch; charset=us-asciiDownload+16-1
0002-Add-a-macro-customizable-hashtable.patchtext/x-patch; charset=us-asciiDownload+798-5
0003-Use-more-efficient-hashtable-for-tidbitmap.c-to-spee.patchtext/x-patch; charset=us-asciiDownload+69-61
0004-Use-more-efficient-hashtable-for-execGrouping.c-to-s.patchtext/x-patch; charset=us-asciiDownload+147-208
#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#4)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

On 10/01/2016 02:44 AM, Andres Freund wrote:

Hi,

On 2016-07-26 17:43:33 -0700, Andres Freund wrote:

In the attached patch I've attached simplehash.h, which can be
customized by a bunch of macros, before being inlined. There's also a
patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
execGrouping.c.

Attached is a significantly improved version of this series. The main
changes are:

- The hash table now uses robin-hood style hash collision handling. See the
commit message of 0002 (or simplehash.h) for an introduction. That
significantly reduces the impact of "clustering" due to linear
addressing.

Interesting. Have you looked at cuckoo hashing? That seems to be the
go-to hashing in several database-related papers I've read recently, so
I guess there's a reason for that (IIRC it has constant worst case for
lookups, constant amortized time for inserts/deletes). Not sure how it
compares to robin-hood hashing regarding cache-friendliness though.

- Significant comment and correctness fixes, both in simplehash.h
- itself, and 0003/0004.
- a lot of other random performance improvements for the hashing code.

Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.

Well, I have rather bad experience with running benchmarks on laptops
anyway - a lot of noise due to power management, etc. What about running
a bigger benchmark - say, TPC-DS - and evaluating the impact?

I think a crucial part of the benchmarking will be identifying and
measuring corner cases, e.g. a lot of rows with the same key, etc.
Although that probably is not a major issue for the two places switched
to the new implementation (e.g. execGrouping merges the duplicates to a
single group, by definition).

This patch series currently consists out of four patches:
- 0001 - add unlikely/likely() macros. The use here is not entirely
mandatory, but they do provide a quite consistent improvement. There
are other threads where they proved to be beneficial, so I see little
reason not to introduce them.
- 0002 - the customizable hashtable itself. It now contains documentation.
- 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
for the issue mentioned in [1], to improve peformance in heavily lossy
scenarios (otherwise we could regress in some extreme cases)
- 0004 - convert execGrouping.c, e.g. used by hash aggregates

While not quite perfect yet, I do think this is at a state where input
is needed. I don't want to go a lot deeper into this rabbit hole,
before we have some agreement that this is a sensible course of action.

So, is it the right time to do some benchmarking?

I think there's several possible additional users of simplehash.h,
e.g. catcache.c - which has an open coded, not particularly fast hash
table and frequently shows up in profiles - but I think the above two
conversions are plenty to start with.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#4)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

On Sat, Oct 1, 2016 at 1:44 AM, Andres Freund <andres@anarazel.de> wrote:

Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.

I have a machine sitting idle now too if you have specific ideas of
what to benchmark.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#5)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

Hi,

On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote:

On 10/01/2016 02:44 AM, Andres Freund wrote:

Hi,

On 2016-07-26 17:43:33 -0700, Andres Freund wrote:

In the attached patch I've attached simplehash.h, which can be
customized by a bunch of macros, before being inlined. There's also a
patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
execGrouping.c.

Attached is a significantly improved version of this series. The main
changes are:

- The hash table now uses robin-hood style hash collision handling. See the
commit message of 0002 (or simplehash.h) for an introduction. That
significantly reduces the impact of "clustering" due to linear
addressing.

Interesting. Have you looked at cuckoo hashing?

Yes. I don't think it's a good fit for modern CPUs. Because it doesn't
probe linearly you constantly have random uncached memory accesses.

I've played with a few schemes, and they all seemed to be slower than
linear probing deviations, unless you intentionally used a wrong
hash-distribution, or intentionally "unbalanced" the hash-space by
iterating over either end of the keyspace and moved the entries around.

- Significant comment and correctness fixes, both in simplehash.h
- itself, and 0003/0004.
- a lot of other random performance improvements for the hashing code.

Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.

Well, I have rather bad experience with running benchmarks on laptops anyway
- a lot of noise due to power management, etc.

Well, with factor ~2x improvements thats not really a big
detractor. Using a new CPU makes some forms of analysis easier (better
PMUs).

For longrunning tests I agree.

What about running a bigger benchmark - say, TPC-DS - and evaluating
the impact?

Worthwhile, although obviously the impact will be a lot smaller, since
they're not entirely bottlenecked on hash-aggs and bitmap scans.

I think a crucial part of the benchmarking will be identifying and measuring
corner cases, e.g. a lot of rows with the same key, etc.
Although that probably is not a major issue for the two places
switched to the new implementation (e.g. execGrouping merges the
duplicates to a single group, by definition).

Hm. I don't really see a case where that's going to cause issues - all
the execGrouping.c users only store one key, and either ignore later
ones, or add the data to the initial tuple in the group. I don't really
see any case where repeated keys should cause an issue for a hash table?

This patch series currently consists out of four patches:
- 0001 - add unlikely/likely() macros. The use here is not entirely
mandatory, but they do provide a quite consistent improvement. There
are other threads where they proved to be beneficial, so I see little
reason not to introduce them.
- 0002 - the customizable hashtable itself. It now contains documentation.
- 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
for the issue mentioned in [1], to improve peformance in heavily lossy
scenarios (otherwise we could regress in some extreme cases)
- 0004 - convert execGrouping.c, e.g. used by hash aggregates

While not quite perfect yet, I do think this is at a state where input
is needed. I don't want to go a lot deeper into this rabbit hole,
before we have some agreement that this is a sensible course of action.

So, is it the right time to do some benchmarking?

That's one thing that's required, yes. The other is whether we can live
with the uglyness of implementing templates via macros. I do think we
can, but...

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#6)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

On 2016-10-01 20:04:18 +0100, Greg Stark wrote:

On Sat, Oct 1, 2016 at 1:44 AM, Andres Freund <andres@anarazel.de> wrote:

Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.

I have a machine sitting idle now too if you have specific ideas of
what to benchmark.

Last time I just extracted interesting parts of queries from tpc-h (to
be able to look at bitmapscans/hash-aggs in isolation) and ran tpc-h as
a whole. During development I also tried to come up with adverse cases
(e.g. huge bitmapscans, with very low work_mem). I don't really have a
lot better ideas than that.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#7)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

On 10/01/2016 09:59 PM, Andres Freund wrote:

Hi,

On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote:

On 10/01/2016 02:44 AM, Andres Freund wrote:

Hi,

On 2016-07-26 17:43:33 -0700, Andres Freund wrote:

In the attached patch I've attached simplehash.h, which can be
customized by a bunch of macros, before being inlined. There's also a
patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
execGrouping.c.

Attached is a significantly improved version of this series. The main
changes are:

- The hash table now uses robin-hood style hash collision handling. See the
commit message of 0002 (or simplehash.h) for an introduction. That
significantly reduces the impact of "clustering" due to linear
addressing.

Interesting. Have you looked at cuckoo hashing?

Yes. I don't think it's a good fit for modern CPUs. Because it
doesn't probe linearly you constantly have random uncached memory
accesses.

I've played with a few schemes, and they all seemed to be slower
than linear probing deviations, unless you intentionally used a
wrong hash-distribution, or intentionally "unbalanced" the hash-space
by iterating over either end of the keyspace and moved the entries
around.

OK, understood.

- Significant comment and correctness fixes, both in simplehash.h
- itself, and 0003/0004.
- a lot of other random performance improvements for the hashing code.

Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.

Well, I have rather bad experience with running benchmarks on laptops anyway
- a lot of noise due to power management, etc.

Well, with factor ~2x improvements thats not really a big detractor.
Using a new CPU makes some forms of analysis easier (better PMUs).

True.

For longrunning tests I agree.

What about running a bigger benchmark - say, TPC-DS - and evaluating
the impact?

Worthwhile, although obviously the impact will be a lot smaller,
since they're not entirely bottlenecked on hash-aggs and bitmap
scans.

Sure, the improvement won't be as significant as on the simple queries,
but it's interesting IMHO.

I think a crucial part of the benchmarking will be identifying and
measuring corner cases, e.g. a lot of rows with the same key, etc.
Although that probably is not a major issue for the two places
switched to the new implementation (e.g. execGrouping merges the
duplicates to a single group, by definition).

Hm. I don't really see a case where that's going to cause issues - all
the execGrouping.c users only store one key, and either ignore later
ones, or add the data to the initial tuple in the group. I don't really
see any case where repeated keys should cause an issue for a hash table?

Yep, that's pretty much what I suggested. I was wondering more about the
other places where this hash table might be used - I've been thinking
about hashjoins, but now that I think of it - that's a fairly
specialized and tuned code. In any case, let's not complicate the
discussion for now.

This patch series currently consists out of four patches:
- 0001 - add unlikely/likely() macros. The use here is not entirely
mandatory, but they do provide a quite consistent improvement. There
are other threads where they proved to be beneficial, so I see little
reason not to introduce them.
- 0002 - the customizable hashtable itself. It now contains documentation.
- 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
for the issue mentioned in [1], to improve peformance in heavily lossy
scenarios (otherwise we could regress in some extreme cases)
- 0004 - convert execGrouping.c, e.g. used by hash aggregates

While not quite perfect yet, I do think this is at a state where input
is needed. I don't want to go a lot deeper into this rabbit hole,
before we have some agreement that this is a sensible course of action.

So, is it the right time to do some benchmarking?

That's one thing that's required, yes. The other is whether we can live
with the uglyness of implementing templates via macros. I do think we
can, but...

Hmmm ... not sure. If one of the points is to get rid of function calls
determined at runtime (which make it impossible for the compiler to
optimize the code), then I can think of three options:

(a) copy-paste the code for each place
(b) use some templating
(c) use JIT

I think (b) is way better than (a), and I don't think we're using JIT
anywhere at this point. So while the macro-based templates look a bit
awkward, I'm not aware of a better C thing.

A few comments, after quickly skimming through the first two patches:

1) SH_CREATE does this:

/* round up size to the next power of 2, eases lookups */
if (size < 2)
size = 2;
else
size = sh_pow2_int(size);

I very much doubt a hash table with 2 buckets is very useful. What about
using some reasonable lower boundary - IIRC hashjoins use 1024 buckets,
which might be a bit too much, but perhaps 256 would work?

2) tb->resize_above

I'd say 'resize_threshold' is a better name. Also, 0.8 should be defined
as a constant somewhere (not sure if it makes sense to make it part of
SH_TYPE, so that hash tables may use different load factors).

3) 'size' is a rather poor name for parameter (e.g. in SH_CREATE or
SH_RESIZE), as it's unclear whether it's 'element size in bytes', 'total
size in bytes' or 'number of entries'.

4) SH_CONTAINS sounds more like a function checking whether a hash table
contains a key, not as a 'type of hash table entry'. What about
SH_ENTRY_TYPE instead? Also, SH_KEYTYPE should be SH_KEY_TYPE I guess.

5) SH_RESIZE

Do we expect to shrink hash tables? If not, SH_RESIZE should probably
check that newsize > oldsize. If yes, it should check that there's
enough space for all entries (with the load factor).

It's not entirely clear why is it guaranteed that there's always an
element with optimal position (when load factor < 1)? Adding an
explanation or a link to a paper would be helpful, I guess.

6) SH_STAT

This bit is a bit suspicious:

uint32 max_collisions = 0;

...

/* single contained element is not a collision */
curcoll--;
total_collisions += curcoll;
if (curcoll > max_collisions)
max_collisions = curcoll - 1;

Shouldn't the last line be just "max_collisions = curcoll"?

7) Should hash agg size estimation for the planner consider the fillfactor?

I think we should account for fillfactor - we should account for the
allocated memory, even if there's free space in the hash table. After
all, that's what hashjoins do.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#9)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

Hi,

On 2016-10-02 01:37:35 +0200, Tomas Vondra wrote:

On 10/01/2016 09:59 PM, Andres Freund wrote:

What about running a bigger benchmark - say, TPC-DS - and evaluating
the impact?

Worthwhile, although obviously the impact will be a lot smaller,
since they're not entirely bottlenecked on hash-aggs and bitmap
scans.

Sure, the improvement won't be as significant as on the simple queries, but
it's interesting IMHO.

Agreed.

I think a crucial part of the benchmarking will be identifying and
measuring corner cases, e.g. a lot of rows with the same key, etc.
Although that probably is not a major issue for the two places
switched to the new implementation (e.g. execGrouping merges the
duplicates to a single group, by definition).

Hm. I don't really see a case where that's going to cause issues - all
the execGrouping.c users only store one key, and either ignore later
ones, or add the data to the initial tuple in the group. I don't really
see any case where repeated keys should cause an issue for a hash table?

Yep, that's pretty much what I suggested. I was wondering more about the
other places where this hash table might be used - I've been thinking about
hashjoins, but now that I think of it - that's a fairly specialized and
tuned code. In any case, let's not complicate the discussion for now.

My point is that I don't really see any potential usecase where that's
an issue.

A few comments, after quickly skimming through the first two patches:

1) SH_CREATE does this:

/* round up size to the next power of 2, eases lookups */
if (size < 2)
size = 2;
else
size = sh_pow2_int(size);

I very much doubt a hash table with 2 buckets is very useful. What about
using some reasonable lower boundary - IIRC hashjoins use 1024 buckets,
which might be a bit too much, but perhaps 256 would work?

For some cases that'll waste a good bit of space. Consider
e.g. catcache.c. Callers can (and do) specify more appropriate sizes for
the individual uses.

2) tb->resize_above

I'd say 'resize_threshold' is a better name. Also, 0.8 should be defined as
a constant somewhere (not sure if it makes sense to make it part of SH_TYPE,
so that hash tables may use different load factors).

Fair point.

3) 'size' is a rather poor name for parameter (e.g. in SH_CREATE or
SH_RESIZE), as it's unclear whether it's 'element size in bytes', 'total
size in bytes' or 'number of entries'.

Ok.

4) SH_CONTAINS sounds more like a function checking whether a hash table
contains a key, not as a 'type of hash table entry'. What about
SH_ENTRY_TYPE instead? Also, SH_KEYTYPE should be SH_KEY_TYPE I guess.

Hm, ok.

5) SH_RESIZE

Do we expect to shrink hash tables?

Not at the moment. Should be doable, but I'm not sure it's worth
developing and testing - I don't see any usecases in pg atm.

It's not entirely clear why is it guaranteed that there's always an element
with optimal position (when load factor < 1)? Adding an explanation or a
link to a paper would be helpful, I guess.

As the load factor always has to be below 1, there always has to be an
empty bucket. Every bucket after an empty bucket has to be at it's
optimal position.

6) SH_STAT

This bit is a bit suspicious:

uint32 max_collisions = 0;

...

/* single contained element is not a collision */
curcoll--;
total_collisions += curcoll;
if (curcoll > max_collisions)
max_collisions = curcoll - 1;

Shouldn't the last line be just "max_collisions = curcoll"?

Hm. That's probably a refactoring artefact. Nice catch.

7) Should hash agg size estimation for the planner consider the fillfactor?

I think we should account for fillfactor - we should account for the
allocated memory, even if there's free space in the hash table. After all,
that's what hashjoins do.

We didn't really for dynahash (as it basically assumed a fillfactor of 1
all the time), not sure if changing it won't also cause issues.

Thanks!

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#10)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

On 10/02/2016 02:17 AM, Andres Freund wrote:

Hi,
...

I think a crucial part of the benchmarking will be identifying and
measuring corner cases, e.g. a lot of rows with the same key, etc.
Although that probably is not a major issue for the two places
switched to the new implementation (e.g. execGrouping merges the
duplicates to a single group, by definition).

Hm. I don't really see a case where that's going to cause issues - all
the execGrouping.c users only store one key, and either ignore later
ones, or add the data to the initial tuple in the group. I don't really
see any case where repeated keys should cause an issue for a hash table?

Yep, that's pretty much what I suggested. I was wondering more about the
other places where this hash table might be used - I've been thinking about
hashjoins, but now that I think of it - that's a fairly specialized and
tuned code. In any case, let's not complicate the discussion for now.

My point is that I don't really see any potential usecase where
that's an issue.

OK, I think we agree the two places modified by the submitted patch
series are fine. Let's leave discussion about places modified by future
patches for the future ;-)

A few comments, after quickly skimming through the first two patches:

1) SH_CREATE does this:

/* round up size to the next power of 2, eases lookups */
if (size < 2)
size = 2;
else
size = sh_pow2_int(size);

I very much doubt a hash table with 2 buckets is very useful. What about
using some reasonable lower boundary - IIRC hashjoins use 1024 buckets,
which might be a bit too much, but perhaps 256 would work?

For some cases that'll waste a good bit of space. Consider
e.g. catcache.c. Callers can (and do) specify more appropriate sizes for
the individual uses.

Hmmm, OK. I have my doubts about those hardcoded constants, but you're
right 256 is probably an overkill.

5) SH_RESIZE

Do we expect to shrink hash tables?

Not at the moment. Should be doable, but I'm not sure it's worth
developing and testing - I don't see any usecases in pg atm.

OK, sure. Then what about adding this to SH_RESIZE?

Assert(oldsize <= newsize);

I assumed we might shrink the catcache after invalidation, for example
(but maybe I don't really know how that works).

It's not entirely clear why is it guaranteed that there's always
an element with optimal position (when load factor < 1)? Adding an
explanation or a link to a paper would be helpful, I guess.

As the load factor always has to be below 1, there always has to be
an empty bucket. Every bucket after an empty bucket has to be at
it's optimal position.

Hmmm, thanks to moving the entries after delete? Adding an assert()
enforcing this seems like a good idea.

7) Should hash agg size estimation for the planner consider the fillfactor?

I think we should account for fillfactor - we should account for the
allocated memory, even if there's free space in the hash table. After all,
that's what hashjoins do.

We didn't really for dynahash (as it basically assumed a fillfactor of 1
all the time), not sure if changing it won't also cause issues.

That's a case of glass half-full vs. half-empty, I guess. If we assumed
load factor 1.0, then I see it as accounting for load factor (while you
see it as not accounting of it).

I don't see why this should cause issues - of course, the hash table may
be a bit larger, exceed work_mem and make the tidbitmap lossy, or switch
HashAggregate to GroupAggregate. But I don't think that's a major issue,
as those decisions depend on estimates anyway, so we can't really
guarantee them.

Also, with load factor 0.8 the table is only ~20% larger, so even if we
don't include that into work_mem it's a reasonably small error (easily
smaller than errors in the pre-9.5 HashJoin accounting, which did not
include chunk headers etc.).

So I won't fight for this, but I don't see why not to account for it.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#11)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

On 2016-10-02 02:59:04 +0200, Tomas Vondra wrote:

On 10/02/2016 02:17 AM, Andres Freund wrote:

Hi,
...

I think a crucial part of the benchmarking will be identifying and
measuring corner cases, e.g. a lot of rows with the same key, etc.
Although that probably is not a major issue for the two places
switched to the new implementation (e.g. execGrouping merges the
duplicates to a single group, by definition).

Hm. I don't really see a case where that's going to cause issues - all
the execGrouping.c users only store one key, and either ignore later
ones, or add the data to the initial tuple in the group. I don't really
see any case where repeated keys should cause an issue for a hash table?

Yep, that's pretty much what I suggested. I was wondering more about the
other places where this hash table might be used - I've been thinking about
hashjoins, but now that I think of it - that's a fairly specialized and
tuned code. In any case, let's not complicate the discussion for now.

My point is that I don't really see any potential usecase where
that's an issue.

OK, I think we agree the two places modified by the submitted patch series
are fine. Let's leave discussion about places modified by future patches for
the future ;-)

I think my problem here is that I just can't see a potential use-case
for hashtables where that's an issue ;)

5) SH_RESIZE

Do we expect to shrink hash tables?

Not at the moment. Should be doable, but I'm not sure it's worth
developing and testing - I don't see any usecases in pg atm.

OK, sure. Then what about adding this to SH_RESIZE?

Assert(oldsize <= newsize);

Yes. And potentially rename it to SH_GROW.

I assumed we might shrink the catcache after invalidation, for example (but
maybe I don't really know how that works).

They don't shrink so far, and in most invalidation cases we don't delete
entries, just mark them as invalid (as they might be referenced on the
outside at that moment).

It's not entirely clear why is it guaranteed that there's always
an element with optimal position (when load factor < 1)? Adding an
explanation or a link to a paper would be helpful, I guess.

As the load factor always has to be below 1, there always has to be
an empty bucket. Every bucket after an empty bucket has to be at
it's optimal position.

Hmmm, thanks to moving the entries after delete?

Pretty much, yes.

Adding an assert() enforcing this seems like a good idea.

Probably requires some additional state to be meaningfully enforced. Not
sure that's worth the effort. If that invariant is broken, not a lot of
stuff works (believe me, I have broken it ;)). Otherwise searches won't
necessarily work anymore, if they didn't end up in their original
position (as the cell would now be empty, a lookup/insert/delete would
not find the existing element).

7) Should hash agg size estimation for the planner consider the fillfactor?

I think we should account for fillfactor - we should account for the
allocated memory, even if there's free space in the hash table. After all,
that's what hashjoins do.

We didn't really for dynahash (as it basically assumed a fillfactor of 1
all the time), not sure if changing it won't also cause issues.

That's a case of glass half-full vs. half-empty, I guess. If we assumed load
factor 1.0, then I see it as accounting for load factor (while you see it as
not accounting of it).

Well, even before we grow by factors of two, so 1 wasn't accurate for
most of the time. My issue isn't really that I don't want to do it, but
that I'm not entirely sure that there's a good way. TupleHashEntryData
itself isn't exactly large, and we'd have to multiply it by the load
factor. At the same time, after the patch, AggStatePerGroupData isn't
allocated for empty elements anymore, and that's the largest part of the
memory. So I'm kind of inclined to just disregard the fillfactor (and
document that).

Also, with load factor 0.8 the table is only ~20% larger, so even if we
don't include that into work_mem it's a reasonably small error (easily
smaller than errors in the pre-9.5 HashJoin accounting, which did not
include chunk headers etc.).

I think in either cases the entries themselves are smaller after the
patch by enough that the overhead doesn't matter once you crossed a few
dozen entries.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Arthur Silva
arthurprs@gmail.com
In reply to: Andres Freund (#4)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

On Sat, Oct 1, 2016 at 2:44 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2016-07-26 17:43:33 -0700, Andres Freund wrote:

In the attached patch I've attached simplehash.h, which can be
customized by a bunch of macros, before being inlined. There's also a
patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
execGrouping.c.

Attached is a significantly improved version of this series. The main
changes are:

- The hash table now uses robin-hood style hash collision handling. See the
commit message of 0002 (or simplehash.h) for an introduction. That
significantly reduces the impact of "clustering" due to linear
addressing.
- Significant comment and correctness fixes, both in simplehash.h
- itself, and 0003/0004.
- a lot of other random performance improvements for the hashing code.

Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.

This patch series currently consists out of four patches:
- 0001 - add unlikely/likely() macros. The use here is not entirely
mandatory, but they do provide a quite consistent improvement. There
are other threads where they proved to be beneficial, so I see little
reason not to introduce them.
- 0002 - the customizable hashtable itself. It now contains documentation.
- 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
for the issue mentioned in [1], to improve peformance in heavily lossy
scenarios (otherwise we could regress in some extreme cases)
- 0004 - convert execGrouping.c, e.g. used by hash aggregates

While not quite perfect yet, I do think this is at a state where input
is needed. I don't want to go a lot deeper into this rabbit hole,
before we have some agreement that this is a sensible course of action.

I think there's several possible additional users of simplehash.h,
e.g. catcache.c - which has an open coded, not particularly fast hash
table and frequently shows up in profiles - but I think the above two
conversions are plenty to start with.

Comments?

Greetings,

Andres Freund

[1] http://archives.postgresql.org/message-id/20160923205843.zcs
533sqdzlh4cpo%40alap3.anarazel.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

This is a great addition.

A couple of comments.

* 80% occupancy is a bit conservative for RH hashing, it works well up to
90% if you use the early stops due to distance. So that TODO is worth
pursuing.

* An optimized memory layout for RH hashmap is along the lines of
HHHKVKVKV, using one bit of the hash as the present/absent flag. That plays
specially well with large occupancies (~90%). Due to RH the probings on the
H array are very short and within a single cache line. Even with a 31bit
hash it's reallyyy rare to have to probe more than 1 entry in the KV array.
Essentially guaranteeing that the 99% percentile of lookups will only hit a
couple of cache lines (if you ignore crossing cache lines and other
details).

#14Andres Freund
andres@anarazel.de
In reply to: Arthur Silva (#13)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

Hi,

On 2016-10-03 13:26:09 +0200, Arthur Silva wrote:

On Sat, Oct 1, 2016 at 2:44 AM, Andres Freund <andres@anarazel.de> wrote:
A couple of comments.
* 80% occupancy is a bit conservative for RH hashing, it works well up to
90% if you use the early stops due to distance. So that TODO is worth
pursuing.

I found 90% a tiny bit slower during modifications, due to the increased
moving of cells around. I actually implemented that TODO at some point,
it's not actually a very clear win for narrow elements and mid-sized
tables - the additional instructions for distance computations cost.

I've played with a different version of robin hood hashing, where the
buckets are ordered by hash-value. I.e. the hashvalue is shifted right,
instead of being masked, to get the hash bucket. That allows to have a
hashtable which is ordered by the hash-value, thus distance checks
simply become >=. The problem with that is that it only works if you
have an "overflow" area at the end of the table, which is hard to size
correctly.

* An optimized memory layout for RH hashmap is along the lines of
HHHKVKVKV, using one bit of the hash as the present/absent flag. That plays
specially well with large occupancies (~90%). Due to RH the probings on the
H array are very short and within a single cache line. Even with a 31bit
hash it's reallyyy rare to have to probe more than 1 entry in the KV array.
Essentially guaranteeing that the 99% percentile of lookups will only hit a
couple of cache lines (if you ignore crossing cache lines and other
details).

That seems likely to end up being noticeably more complicated - I'm not
sure the cpu overhead of the more complex lookups weighs of the
benefits. I'd like to get the basics in, we can optimize further later
on. Based on my instrumentation the majority of time is currently spent
in the cache-miss for the initial bucket, everything else inside the
hash table code is quite small. After that it's hash value computations
in the execGrouping case.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#11)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

Hi,

Attached is an updated version of the patchset. The main changes are
- address most of Tomas' feedback
- address regression test output changes by adding explicit ORDER BYs,
in a separate commit.
- fix issues around hash tables sized up to PG_UINT32_MAX
- fix a bug in iteration with "concurrent" deletions

We didn't really for dynahash (as it basically assumed a fillfactor of 1
all the time), not sure if changing it won't also cause issues.

That's a case of glass half-full vs. half-empty, I guess. If we assumed load
factor 1.0, then I see it as accounting for load factor (while you see it as
not accounting of it).

Well, that load factor is almost never achieved, because we'd have grown
since... I added a comment about disregarding fill factor and growth
policy to estimate_hashagg_tablesize, which is actually the place where
it'd seemingly make sense to handle it.

Tomas, did you have time to run a benchmark?

I think this is starting to look good.

Regards,

Andres

Attachments:

0003-Add-a-macro-customizable-hashtable.patchtext/x-patch; charset=us-asciiDownload+877-5
0001-Add-likely-unlikely-branch-hint-macros.patchtext/x-patch; charset=us-asciiDownload+16-1
0002-Make-regression-tests-less-dependent-on-hash-table-o.patchtext/x-patch; charset=us-asciiDownload+125-110
0004-Use-more-efficient-hashtable-for-tidbitmap.c-to-spee.patchtext/x-patch; charset=us-asciiDownload+69-61
0005-Use-more-efficient-hashtable-for-execGrouping.c-to-s.patchtext/x-patch; charset=us-asciiDownload+131-185
#16Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#15)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

On 10/10/2016 01:38 AM, Andres Freund wrote:

Hi,

Attached is an updated version of the patchset. The main changes are
- address most of Tomas' feedback
- address regression test output changes by adding explicit ORDER BYs,
in a separate commit.
- fix issues around hash tables sized up to PG_UINT32_MAX
- fix a bug in iteration with "concurrent" deletions

We didn't really for dynahash (as it basically assumed a fillfactor of 1
all the time), not sure if changing it won't also cause issues.

That's a case of glass half-full vs. half-empty, I guess. If we assumed load
factor 1.0, then I see it as accounting for load factor (while you see it as
not accounting of it).

Well, that load factor is almost never achieved, because we'd have grown
since... I added a comment about disregarding fill factor and growth
policy to estimate_hashagg_tablesize, which is actually the place where
it'd seemingly make sense to handle it.

Tomas, did you have time to run a benchmark?

Yes, I've done a bunch of TPC-H and TPC-DS tests on the patch, but it
took quite a bit of time. These tests were done on a fairly small
machine (the usual i5-2500k with 8GB of RAM), with only 1GB data sets
for both benchmarks, to keep it completely in memory as I presume once
we start hitting I/O, it becomes the dominant part.

The machine was tuned a bit (shared_buffers=1GB, work_mem=256MB). There
was no parallelism enabled, and the tests were done first with a single
client and then with 4 clients running the queries for 4 hours.

I plan to run the tests on some larger machine (with more RAM, allowing
to use larger data sets while still keeping it in memory). But the
machine is busy doing something else, so it'll have to wait.

I didn't have time to do any sort of in-depth analysis (and I don't
expect to have the time in foreseeable future) - in particular I have
not looked at execution plans or anything like that. But let me show
some quick summary:

TPC-H (tpch.ods)
----------------

The results are either neutral or slightly positive - both in the case
of single stream (summary) and 4 parallel streams (summary-4-streams).
BTW I've happened to run the single stream tests twice while debugging
some tooling issues, so that's why there are two sets of columns.

Each stream executed ~9200 queries in total, which means ~450 executions
for each query template. So, a lot of data, and the results look quite
stable (despite the query parameters being random).

There are a few queries that got a tiny bit (~3%) slower, but most of
those queries are very short anyway (0.1s) making them susceptible to
noise. On the other hand, there are a few queries that got ~10% faster,
which is nice. It's not something that would significantly change our
TPC-H scores, but not bad either.

TPC-DS (tpcds.ods)
------------------

In this case, I'd say the results are less convincing. There are quite a
few queries that got slower by ~10%, which is well above - for example
queries 22 and 67. There are of course queries that got ~10% faster, and
in total the patched version executed more queries (so overall the
result is slightly positive, but not significantly).

regards
Tomas

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

tpcds.odsapplication/vnd.oasis.opendocument.spreadsheet; name=tpcds.odsDownload
tpch.odsapplication/vnd.oasis.opendocument.spreadsheet; name=tpch.odsDownload+15-8
#17Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#16)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

Hi,

On 2016-10-11 02:38:26 +0200, Tomas Vondra wrote:

Yes, I've done a bunch of TPC-H and TPC-DS tests on the patch, but it took
quite a bit of time. These tests were done on a fairly small machine (the
usual i5-2500k with 8GB of RAM), with only 1GB data sets for both
benchmarks, to keep it completely in memory as I presume once we start
hitting I/O, it becomes the dominant part.

Thanks!

TPC-DS (tpcds.ods)
------------------

In this case, I'd say the results are less convincing. There are quite a few
queries that got slower by ~10%, which is well above - for example queries
22 and 67. There are of course queries that got ~10% faster, and in total
the patched version executed more queries (so overall the result is slightly
positive, but not significantly).

That's interesting. I wonder whether that's plan changes just due to the
changing memory estimates, or what's causing that. I'll look into it.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#17)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

On 2016-10-10 17:46:22 -0700, Andres Freund wrote:

TPC-DS (tpcds.ods)
------------------

In this case, I'd say the results are less convincing. There are quite a few
queries that got slower by ~10%, which is well above - for example queries
22 and 67. There are of course queries that got ~10% faster, and in total
the patched version executed more queries (so overall the result is slightly
positive, but not significantly).

That's interesting. I wonder whether that's plan changes just due to the
changing memory estimates, or what's causing that. I'll look into it.

Hm. Based on an initial look those queries aren't planned with any of
the affected codepaths. Could this primarily be a question of
randomness? Would it perhaps make sense to run the tests in a comparable
order? Looking at tpcds.py and the output files, it seems that the query
order differes between the runs, that can easily explain bigger
difference than the above. For me a scale=1 run creates a database of
approximately 4.5GB, thus with shared_buffers=1GB execution order is
likely to have a significant performance impact.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#18)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

On 10/11/2016 04:07 AM, Andres Freund wrote:

On 2016-10-10 17:46:22 -0700, Andres Freund wrote:

TPC-DS (tpcds.ods)
------------------

In this case, I'd say the results are less convincing. There are quite a few
queries that got slower by ~10%, which is well above - for example queries
22 and 67. There are of course queries that got ~10% faster, and in total
the patched version executed more queries (so overall the result is slightly
positive, but not significantly).

That's interesting. I wonder whether that's plan changes just due to the
changing memory estimates, or what's causing that. I'll look into it.

Hm. Based on an initial look those queries aren't planned with any of
the affected codepaths. Could this primarily be a question of
randomness? Would it perhaps make sense to run the tests in a comparable
order? Looking at tpcds.py and the output files, it seems that the query
order differes between the runs, that can easily explain bigger
difference than the above. For me a scale=1 run creates a database of
approximately 4.5GB, thus with shared_buffers=1GB execution order is
likely to have a significant performance impact.

Yes, I see similar plans (no bitmap index scans or hash aggregates). But
the difference is there, even when running the query alone (so it's not
merely due to the randomized ordering).

I wonder whether this is again due to compiler moving stuff around.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#19)
Re: Macro customizable hashtable / bitmapscan & aggregation perf

On 2016-10-11 04:29:31 +0200, Tomas Vondra wrote:

On 10/11/2016 04:07 AM, Andres Freund wrote:

On 2016-10-10 17:46:22 -0700, Andres Freund wrote:

TPC-DS (tpcds.ods)
------------------

In this case, I'd say the results are less convincing. There are quite a few
queries that got slower by ~10%, which is well above - for example queries
22 and 67. There are of course queries that got ~10% faster, and in total
the patched version executed more queries (so overall the result is slightly
positive, but not significantly).

That's interesting. I wonder whether that's plan changes just due to the
changing memory estimates, or what's causing that. I'll look into it.

Hm. Based on an initial look those queries aren't planned with any of
the affected codepaths. Could this primarily be a question of
randomness? Would it perhaps make sense to run the tests in a comparable
order? Looking at tpcds.py and the output files, it seems that the query
order differes between the runs, that can easily explain bigger
difference than the above. For me a scale=1 run creates a database of
approximately 4.5GB, thus with shared_buffers=1GB execution order is
likely to have a significant performance impact.

Yes, I see similar plans (no bitmap index scans or hash aggregates). But the
difference is there, even when running the query alone (so it's not merely
due to the randomized ordering).

I wonder whether this is again due to compiler moving stuff around.

It looks like that. I looked through a significant set of plans where
there we time differences (generated on my machine), and none of them
had bitmap or hash groupings to any significant degree. Comparing
profiles in those cases usually shows a picture like:
24.98% +0.32% postgres [.] slot_deform_tuple
16.58% -0.05% postgres [.] ExecMakeFunctionResultNoSets
12.41% -0.01% postgres [.] slot_getattr
6.10% +0.39% postgres [.] heap_getnext
4.41% -0.37% postgres [.] ExecQual
3.08% +0.12% postgres [.] ExecEvalScalarVarFast
2.85% -0.11% postgres [.] check_stack_depth
2.48% +0.42% postgres [.] ExecEvalConst
2.44% -0.33% postgres [.] heapgetpage
2.34% +0.11% postgres [.] ExecScan
2.14% -0.20% postgres [.] ExecStoreTuple

I.e. pretty random performance changes. This indeed looks like binary
layout changes. Looking at these plans and at profiles spanning a run
of all queries shows that bitmap scans and hash aggregations, while
present, account for a very small amount of time in total. So tpc-ds
doesn't look particularly interesting to evaluate these patches - but
vey interesting for my slot deforming and qual evaluation patches.

Btw, query_14.sql as generated by your templates (in pgperffarm) doesn't
seem to work here. And I never had the patience to run query_1.sql to
completion... Looks like we could very well use some planner
improvements here.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#20)
#22Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Andres Freund (#15)
#23Andres Freund
andres@anarazel.de
In reply to: Mark Dilger (#22)