Performance degradation in TPC-H Q18

Started by Kuntal Ghoshabout 9 years ago27 messageshackers
Jump to latest
#1Kuntal Ghosh
kuntalghosh.2007@gmail.com

Hello everyone,

While conducting the experiments for parallelism, Rafia came across a
hang in Q18 when plan uses partial and finalize hash aggregate. This
could be seen on both scale factors - 20 and 300, on setting work_mem
high enough so that the query uses hash aggregate. It seems that
commit b81b5a96f424531b97cdd1dba97d9d1b9c9d372e does not solve the
issue completely.

I've tested on scale factor 10 and work_mem 100mb and the issue is
reproducible. Below is the simplified part of the query which uses the
hash aggregate.

tpch10GB=# explain select l_orderkey from lineitem group by l_orderkey;

QUERY PLAN
------------------------------------------------------------------------------------------------
Finalize HashAggregate (cost=1612872.73..1616801.32 rows=392859 width=8)
Group Key: l_orderkey
-> Gather (cost=1528408.04..1610908.43 rows=785718 width=8)
Workers Planned: 2
-> Partial HashAggregate (cost=1527408.04..1531336.63
rows=392859 width=8)
Group Key: l_orderkey
-> Parallel Seq Scan on lineitem
(cost=0.00..1464922.43 rows=24994243 width=8)
(7 rows)

I've also tried to execute the same query with a different column name.

tpch10GB=# explain analyze select l_suppkey from lineitem group by l_suppkey;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Finalize HashAggregate (cost=1549511.80..1550493.37 rows=98157 width=8)
(actual time=22363.901..22380.361 rows=100000 loops=1)
Group Key: l_suppkey
-> Gather (cost=1528408.04..1549021.01 rows=196314 width=8)
(actual time=22107.387..22244.848 rows=300000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=1527408.04..1528389.61
rows=98157 width=8)
(actual time=22100.776..22135.063 rows=100000 loops=3)
Group Key: l_suppkey
-> Parallel Seq Scan on lineitem
(cost=0.00..1464922.43 rows=24994243 width=8)
(actual time=0.846..10664.258 rows=19995351 loops=3)
Planning time: 0.171 ms
Execution time: 22393.030 ms
(10 rows)

And, it finished in time although the stats in both the plans are
exactly same. Hence, the problem is with the query itself. As Robert
suggested offline, I've produced the perf report for the query.

99.44% 0.01% postgres postgres [.] LookupTupleHashEntry
74.31% 53.55% postgres postgres [.] tuplehash_insert
17.05% 17.04% postgres libc-2.17.so [.] __memcpy_ssse3_back
11.06% 11.05% postgres postgres [.] tuplehash_prev
10.81% 10.80% postgres postgres [.] tuplehash_next

It's clear from the perf report that something terrible is going on in
the simple hash insert. I've enabled the logging for hash
ops(SH_STAT()) during hash growing phase.

2017-02-21 16:42:39.634 IST [9372] LOG: size: 1048576, members:
838860, filled: 0.799999,
total chain: 2008236860, max chain: 66743, avg chain: 2394.007176,
total_collisions: 278348, max_collisions: 8, avg_collisions: 0.331817
2017-02-21 16:42:39.634 IST [9372] STATEMENT: explain analyze select
l_orderkey from lineitem group by l_orderkey;

2017-02-21 16:42:39.741 IST [9372] LOG: size: 2097152, members:
838860, filled: 0.400000,
total chain: 788175, max chain: 124, avg chain: 0.939579,
total_collisions: 216052, max_collisions: 6, avg_collisions: 0.257554
2017-02-21 16:42:39.741 IST [9372] STATEMENT: explain analyze select
l_orderkey from lineitem group by l_orderkey;

The value of max chain length is really high which hampers the query's
performance. To further understand the scenario, I've taken the
following results:

tpch10GB=# select count(distinct l_orderkey) from lineitem;
count
----------
15000000

tpch10GB=# select correlation from pg_stats where tablename='lineitem'
and attname='l_orderkey';
correlation
-------------
1

tpch10GB=# select count(distinct l_suppkey) from lineitem;
count
--------
100000

tpch10GB=# select correlation from pg_stats where tablename='lineitem'
and attname='l_suppkey';
correlation
-------------
0.00900259

It seems that l_orderkey produces a pretty bad case for hashing with
linear probing and a bad choice of hash function(as linear probing is
sensitive to elements with nearby hash codes). In [1]/messages/by-id/20161123083351.5vramz52nmdokhzz@alap3.anarazel.de 0001-Resize-simplehash.h-table-in-case-of-long-runs.patch, Andres's
proposed a hack to resize the hashtable when the chain is growing
quickly. Below is a comment from the patch explaining the objective:

+            /*
+             * To avoid negative consequences from overly imbalanced
+             * hashtables, grow the hashtable if collisions lead to large
+             * runs. The most likely cause of such imbalance is filling a
+             * (currently) small table, from a currently big one, in
+             * hash-table order.
+             *
+             * FIXME: compute boundaries in a more principled manner.
+             */

After applying the patch, TPC-H Q18 and the simplified query, both
completes in time. Although the patch works in this scenario, we need
a smarter way to trigger the hash expansion. The problem with the
patch is that it triggers a hash expansion even when the filled
percentage is pretty low, causing unnecessary memory consumption.

Thoughts?

[1]: /messages/by-id/20161123083351.5vramz52nmdokhzz@alap3.anarazel.de 0001-Resize-simplehash.h-table-in-case-of-long-runs.patch
0001-Resize-simplehash.h-table-in-case-of-long-runs.patch

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Robert Haas
robertmhaas@gmail.com
In reply to: Kuntal Ghosh (#1)
Re: Performance degradation in TPC-H Q18

On Wed, Feb 22, 2017 at 11:23 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

While conducting the experiments for parallelism, Rafia came across a
hang in Q18 when plan uses partial and finalize hash aggregate. This
could be seen on both scale factors - 20 and 300, on setting work_mem
high enough so that the query uses hash aggregate. It seems that
commit b81b5a96f424531b97cdd1dba97d9d1b9c9d372e does not solve the
issue completely.

Andres, any thoughts? Isn't the same issue that we were discussing
/messages/by-id/CA+TgmoYNO8qouPVO=1Q2aXzuxe942d_T5bcvZd9iKOC9tb3uLg@mail.gmail.com
over a month ago?

To me, it seems like a big problem that we have large, unfixed
performance regressions with simplehash four months after it went in.
I hate to suggest ripping the whole thing out, and it seems like
overkill, but it's pretty clear to me that the current state of things
is unacceptable, and that we're going to have a lot of unhappy users
if we ship it the way that it is right now. I want to point out that
the kinds of problems we're hitting here are exactly what I told you I
was worried about before it went in - that the average-case
performance would be better but that there would be
all-too-easy-to-hit cases where things got much worse because the
whole thing degenerated into a linear search. Not only did that
happen, but there seem to be multiple ways of producing it without
half trying, of which b81b5a96f424531b97cdd1dba97d9d1b9c9d372e fixed
only one. Something that's 5-10% faster in common cases but 2x or 10x
slower when things go wrong is not an improvement.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#2)
Re: Performance degradation in TPC-H Q18

Hi,

On 2017-02-26 19:30:32 +0530, Robert Haas wrote:

On Wed, Feb 22, 2017 at 11:23 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:

While conducting the experiments for parallelism, Rafia came across a
hang in Q18 when plan uses partial and finalize hash aggregate. This
could be seen on both scale factors - 20 and 300, on setting work_mem
high enough so that the query uses hash aggregate. It seems that
commit b81b5a96f424531b97cdd1dba97d9d1b9c9d372e does not solve the
issue completely.

Andres, any thoughts? Isn't the same issue that we were discussing
/messages/by-id/CA+TgmoYNO8qouPVO=1Q2aXzuxe942d_T5bcvZd9iKOC9tb3uLg@mail.gmail.com
over a month ago?

Yes, I presume that it is.

To me, it seems like a big problem that we have large, unfixed
performance regressions with simplehash four months after it went in.

Yea, I agree. I'm fairly sure that the patch I posted in that thread
actually fixes the issue (and would also have made already applied hash
patch of yours a small-ish optimization). I held back back because I
disliked the idea of magic constants, and I couldn't figure out a way to
properly "derive" them - but I'm inclined to simply live with the magic constsnts.

I hate to suggest ripping the whole thing out, and it seems like
overkill

Yea, I don't think we're there yet. Let me push the patch that resizes
adaptively, and we can see how things are going from there.

, but it's pretty clear to me that the current state of things
is unacceptable, and that we're going to have a lot of unhappy users
if we ship it the way that it is right now.

Yea, if we can't improve upon the current state, we'd need to revert.

I want to point out that
the kinds of problems we're hitting here are exactly what I told you I
was worried about before it went in - that the average-case
performance would be better but that there would be
all-too-easy-to-hit cases where things got much worse because the
whole thing degenerated into a linear search. Not only did that
happen, but there seem to be multiple ways of producing it without
half trying, of which b81b5a96f424531b97cdd1dba97d9d1b9c9d372e fixed
only one.

Note that that one is also fixed with what I'm proposing (but it's a
worthwhile small-ish improvement nonetheless).

Something that's 5-10% faster in common cases but 2x or 10x
slower when things go wrong is not an improvement.

The hash-table part is more like 3x faster - but as so often when making
things faster, the next bottleneck is just around the corner...

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#3)
Re: Performance degradation in TPC-H Q18

On Tue, Feb 28, 2017 at 11:16 PM, Andres Freund <andres@anarazel.de> wrote:

To me, it seems like a big problem that we have large, unfixed
performance regressions with simplehash four months after it went in.

Yea, I agree. I'm fairly sure that the patch I posted in that thread
actually fixes the issue (and would also have made already applied hash
patch of yours a small-ish optimization). I held back back because I
disliked the idea of magic constants, and I couldn't figure out a way to
properly "derive" them - but I'm inclined to simply live with the magic constsnts.

I think it's better to push in a proposed fix and see how it holds up
than to leave the tree in an unfixed state for long periods of time.
If the fix is good, then the fact that there's a magic constant
doesn't really matter all that much, and it can always be improved
later. On the other hand, if the fix is bad, pushing it improves the
chances of finding the problems. Not many people are going to test an
uncommitted fix.

BTW, another option to consider is lowering the target fillfactor.
IIRC, Kuntal mentioned to me that cranking it down seemed to fix the
issue. Obviously, it's better to detect when we need a lower
fillfactor than to always use a lower one, but obviously the tighter
you pack the hash table, the more likely it is that you're going to
have these kinds of problems.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#4)
Re: Performance degradation in TPC-H Q18

On 2017-03-01 08:42:35 +0530, Robert Haas wrote:

On Tue, Feb 28, 2017 at 11:16 PM, Andres Freund <andres@anarazel.de> wrote:

To me, it seems like a big problem that we have large, unfixed
performance regressions with simplehash four months after it went in.

Yea, I agree. I'm fairly sure that the patch I posted in that thread
actually fixes the issue (and would also have made already applied hash
patch of yours a small-ish optimization). I held back back because I
disliked the idea of magic constants, and I couldn't figure out a way to
properly "derive" them - but I'm inclined to simply live with the magic constsnts.

I think it's better to push in a proposed fix and see how it holds up
than to leave the tree in an unfixed state for long periods of time.

Yea, that was a mistake. I kind of was hoping for an epiphany that
has not yet come.

BTW, another option to consider is lowering the target fillfactor.
IIRC, Kuntal mentioned to me that cranking it down seemed to fix the
issue. Obviously, it's better to detect when we need a lower
fillfactor than to always use a lower one, but obviously the tighter
you pack the hash table, the more likely it is that you're going to
have these kinds of problems.

Yea, that'd be one approach, but I feel better dynamically increasing
the fillfactor like in the patch. Even with a lower fillfactor you could
see issues, and as you say a higher fillfactor is nice [TM]; after the
patch I played with *increasing* the fillfactor, without being able to
measure a downside.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Andres Freund (#5)
Re: Performance degradation in TPC-H Q18

On Wed, Mar 1, 2017 at 8:48 AM, Andres Freund <andres@anarazel.de> wrote:

BTW, another option to consider is lowering the target fillfactor.
IIRC, Kuntal mentioned to me that cranking it down seemed to fix the
issue. Obviously, it's better to detect when we need a lower
fillfactor than to always use a lower one, but obviously the tighter
you pack the hash table, the more likely it is that you're going to
have these kinds of problems.

Yea, that'd be one approach, but I feel better dynamically increasing
the fillfactor like in the patch. Even with a lower fillfactor you could
see issues, and as you say a higher fillfactor is nice [TM]; after the
patch I played with *increasing* the fillfactor, without being able to
measure a downside.

I've tested with different fill factors and the query got completed
with fill factor 0.6.

With linear probing, the performance degrades more quickly at high
fill factors because of primary clustering, a tendency for one
collision to cause more nearby collisions. So, increasing the fill
factor further doesn't seem to be a good idea. Besides, when I've
tested with the patch, I've seen hash table grows even when the 10-12%
of the hash table is filled.

LOG: size: 8388608, members: 845056, filled: 0.100739, total chain:
735723, max chain: 37, avg chain: 0.870620, total_collisions: 219101,
max_collisions: 6, avg_collisions: 0.259274
STATEMENT: explain analyze select l_orderkey from lineitem group by l_orderkey;
LOG: size: 16777216, members: 845056, filled: 0.050369, total chain:
197047, max chain: 6, avg chain: 0.233176, total_collisions: 121185,
max_collisions: 5, avg_collisions: 0.143405
STATEMENT: explain analyze select l_orderkey from lineitem group by l_orderkey;
LOG: size: 16777216, members: 2127649, filled: 0.126818, total chain:
1471327, max chain: 35, avg chain: 0.691527, total_collisions: 486310,
max_collisions: 6, avg_collisions: 0.228567
STATEMENT: explain analyze select l_orderkey from lineitem group by l_orderkey;
LOG: size: 33554432, members: 2127649, filled: 0.063409, total chain:
413073, max chain: 7, avg chain: 0.194145, total_collisions: 265766,
max_collisions: 5, avg_collisions: 0.124911
STATEMENT: explain analyze select l_orderkey from lineitem group by l_orderkey;

This seems bad. We're wasting a lot of memory in the hash table.

Apart from that, I was thinking about the following optimization in the code.
/*
* If the bucket is not empty, we either found a match (in which case
* we're done), or we have to decide whether to skip over or move the
* colliding entry. When the colliding element's distance to its
* optimal position is smaller than the to-be-inserted entry's, we
* shift the colliding entry (and its followers) forward by one.
*/
I'm worried about the fact that we are increasing the chain length of
each follower. It may happen that we've increased the chain length of
the last element by a huge factor.

So, I was looking for other alternatives and I've found one called
RobinHood hashing. In this approach, when you probe for a position to
insert a new element, if the probe length for the existing element is
less than the current probe length for the element being inserted,
swap the two elements and keep going. It leads to a low variance for
the chain lengths rather than the last one. Is this approach already
considered?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andres Freund
andres@anarazel.de
In reply to: Kuntal Ghosh (#6)
Re: Performance degradation in TPC-H Q18

On 2017-03-01 09:13:15 +0530, Kuntal Ghosh wrote:

On Wed, Mar 1, 2017 at 8:48 AM, Andres Freund <andres@anarazel.de> wrote:

BTW, another option to consider is lowering the target fillfactor.
IIRC, Kuntal mentioned to me that cranking it down seemed to fix the
issue. Obviously, it's better to detect when we need a lower
fillfactor than to always use a lower one, but obviously the tighter
you pack the hash table, the more likely it is that you're going to
have these kinds of problems.

Yea, that'd be one approach, but I feel better dynamically increasing
the fillfactor like in the patch. Even with a lower fillfactor you could
see issues, and as you say a higher fillfactor is nice [TM]; after the
patch I played with *increasing* the fillfactor, without being able to
measure a downside.

I've tested with different fill factors and the query got completed
with fill factor 0.6.

That's without the patch in
http://archives.postgresql.org/message-id/20161123083351.5vramz52nmdokhzz%40alap3.anarazel.de
? With that patch it should complete without that change, right?

With linear probing, the performance degrades more quickly at high
fill factors because of primary clustering, a tendency for one
collision to cause more nearby collisions. So, increasing the fill
factor further doesn't seem to be a good idea.

Well, the idea is increasing it, but *also* detecting clustering and
adaptively resizing earlier.

So, I was looking for other alternatives and I've found one called
RobinHood hashing.

simplehash.h implements robin hood hashing.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#5)
Re: Performance degradation in TPC-H Q18

On Wed, Mar 1, 2017 at 8:48 AM, Andres Freund <andres@anarazel.de> wrote:

BTW, another option to consider is lowering the target fillfactor.
IIRC, Kuntal mentioned to me that cranking it down seemed to fix the
issue. Obviously, it's better to detect when we need a lower
fillfactor than to always use a lower one, but obviously the tighter
you pack the hash table, the more likely it is that you're going to
have these kinds of problems.

Yea, that'd be one approach, but I feel better dynamically increasing
the fillfactor like in the patch. Even with a lower fillfactor you could
see issues, and as you say a higher fillfactor is nice [TM]; after the
patch I played with *increasing* the fillfactor, without being able to
measure a downside.

I am going to bet that increasing the fillfactor would be a mistake.
The expected length of a clump in the table is going to be about
1/(1-ff), which increases very quickly beyond the current value of
0.8. For example, if the actual fillfactor is 0.9 and the keys are
perfectly distributed -- nine in a row and then one empty slot,
lather, rinse, repeat -- probing for a key that's not there has a 10%
chance of hitting an empty slot, a 10% chance of hitting the slot just
before an empty slot, and so on. So the expected number of probes to
find that there's no match is 4.5. However, it could easily happen
that you have 3 or 4 empty slots in a row in one place and 27 or 36
occupied slots in a row in another place, and that could cause all
kinds of grief. Moreover, real-world hash functions on real-world
data aren't always perfect, which increases the chances of things
going off track. I'm not exactly sure how high a fillfactor is too
high, but note that
https://en.wikipedia.org/wiki/Hash_table#Open_addressing claims that
"performance dramatically degrades when the load factor grows beyond
0.7 or so", which isn't cited but the same text appears in
https://www.unf.edu/~wkloster/3540/wiki_book2.pdf for whatever that's
worth.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#8)
Re: Performance degradation in TPC-H Q18

On 2017-03-01 09:21:47 +0530, Robert Haas wrote:

On Wed, Mar 1, 2017 at 8:48 AM, Andres Freund <andres@anarazel.de> wrote:

BTW, another option to consider is lowering the target fillfactor.
IIRC, Kuntal mentioned to me that cranking it down seemed to fix the
issue. Obviously, it's better to detect when we need a lower
fillfactor than to always use a lower one, but obviously the tighter
you pack the hash table, the more likely it is that you're going to
have these kinds of problems.

Yea, that'd be one approach, but I feel better dynamically increasing
the fillfactor like in the patch. Even with a lower fillfactor you could
see issues, and as you say a higher fillfactor is nice [TM]; after the
patch I played with *increasing* the fillfactor, without being able to
measure a downside.

I am going to bet that increasing the fillfactor would be a mistake.
The expected length of a clump in the table is going to be about
1/(1-ff), which increases very quickly beyond the current value of
0.8. For example, if the actual fillfactor is 0.9 and the keys are
perfectly distributed -- nine in a row and then one empty slot,
lather, rinse, repeat -- probing for a key that's not there has a 10%
chance of hitting an empty slot, a 10% chance of hitting the slot just
before an empty slot, and so on. So the expected number of probes to
find that there's no match is 4.5. However, it could easily happen
that you have 3 or 4 empty slots in a row in one place and 27 or 36
occupied slots in a row in another place, and that could cause all
kinds of grief. Moreover, real-world hash functions on real-world
data aren't always perfect, which increases the chances of things
going off track. I'm not exactly sure how high a fillfactor is too
high, but note that
https://en.wikipedia.org/wiki/Hash_table#Open_addressing claims that
"performance dramatically degrades when the load factor grows beyond
0.7 or so", which isn't cited but the same text appears in
https://www.unf.edu/~wkloster/3540/wiki_book2.pdf for whatever that's
worth.

That's without the "trick" of "robin hood" hashing though:
https://en.wikipedia.org/wiki/Hash_table#Robin_Hood_hashing
http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion/
https://www.pvk.ca/Blog/2013/11/26/the-other-robin-hood-hashing/

it probably depends a bit on the scenario how high you want to have the
fillfactor - for hashaggs a high fill factor would probably be good
(they're often "read" heavy), for bitmapscans less so.

But I guess there's not much point in immediately increasing the
fillfactor, can do that in a second step.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Andres Freund (#7)
Re: Performance degradation in TPC-H Q18

On Wed, Mar 1, 2017 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-03-01 09:13:15 +0530, Kuntal Ghosh wrote:

On Wed, Mar 1, 2017 at 8:48 AM, Andres Freund <andres@anarazel.de> wrote:

BTW, another option to consider is lowering the target fillfactor.
IIRC, Kuntal mentioned to me that cranking it down seemed to fix the
issue. Obviously, it's better to detect when we need a lower
fillfactor than to always use a lower one, but obviously the tighter
you pack the hash table, the more likely it is that you're going to
have these kinds of problems.

Yea, that'd be one approach, but I feel better dynamically increasing
the fillfactor like in the patch. Even with a lower fillfactor you could
see issues, and as you say a higher fillfactor is nice [TM]; after the
patch I played with *increasing* the fillfactor, without being able to
measure a downside.

I've tested with different fill factors and the query got completed
with fill factor 0.6.

That's without the patch in
http://archives.postgresql.org/message-id/20161123083351.5vramz52nmdokhzz%40alap3.anarazel.de
? With that patch it should complete without that change, right?

No, it's with the patch in the above link which is
0001-Resize-simplehash.h-table-in-case-of-long-runs.patch. But, I've
kept the SH_FILLFACTOR to 0.8 and SH_MAX_FILLFACTOR to 0.95. I'll do
another round of testing with the constants provided by you.

With linear probing, the performance degrades more quickly at high
fill factors because of primary clustering, a tendency for one
collision to cause more nearby collisions. So, increasing the fill
factor further doesn't seem to be a good idea.

Well, the idea is increasing it, but *also* detecting clustering and
adaptively resizing earlier.

So, I was looking for other alternatives and I've found one called
RobinHood hashing.

simplehash.h implements robin hood hashing.

But, it doesn't implement the swapping idea, right?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Andres Freund
andres@anarazel.de
In reply to: Kuntal Ghosh (#10)
Re: Performance degradation in TPC-H Q18

Hi,

On 2017-03-01 09:33:07 +0530, Kuntal Ghosh wrote:

On Wed, Mar 1, 2017 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote:

So, I was looking for other alternatives and I've found one called
RobinHood hashing.

simplehash.h implements robin hood hashing.

But, it doesn't implement the swapping idea, right?

It does, that's the if (insertdist > curdist) block in SH_INSERT.
Unless I misunderstand what you're proposing?

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Kuntal Ghosh (#10)
Re: Performance degradation in TPC-H Q18

On Wed, Mar 1, 2017 at 9:33 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Wed, Mar 1, 2017 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote:

That's without the patch in
http://archives.postgresql.org/message-id/20161123083351.5vramz52nmdokhzz%40alap3.anarazel.de
? With that patch it should complete without that change, right?

No, it's with the patch in the above link which is
0001-Resize-simplehash.h-table-in-case-of-long-runs.patch. But, I've
kept the SH_FILLFACTOR to 0.8 and SH_MAX_FILLFACTOR to 0.95. I'll do
another round of testing with the constants provided by you.

I've tested with the patch
0001-Resize-simplehash.h-table-in-case-of-long-runs.patch and the
results are same, i.e., hash table grows even when it is only 10-12%
filled. I've attached the logfile for reference.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

logfileapplication/octet-stream; name=logfileDownload
#13Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Andres Freund (#11)
Re: Performance degradation in TPC-H Q18

On Wed, Mar 1, 2017 at 9:38 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2017-03-01 09:33:07 +0530, Kuntal Ghosh wrote:

On Wed, Mar 1, 2017 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote:

So, I was looking for other alternatives and I've found one called
RobinHood hashing.

simplehash.h implements robin hood hashing.

But, it doesn't implement the swapping idea, right?

It does, that's the if (insertdist > curdist) block in SH_INSERT.
Unless I misunderstand what you're proposing?

I think the idea of the existing implementation is:

if (insertdist > curdist)
{
find an empty space at the end of the cluster.
shift all the followers by 1 position to make space for the element to
be inserted.
insert the element at the current place.
}

What I'm trying to say is:

if (insertdist > curdist)
{
swap the entry to be inserted with the current entry.
Try to insert the current entry in the same logic.
}

So, the second approach will not cause all the followers to be shifted by 1.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
andres@anarazel.de
In reply to: Kuntal Ghosh (#13)
Re: Performance degradation in TPC-H Q18

On 2017-03-01 10:47:45 +0530, Kuntal Ghosh wrote:

if (insertdist > curdist)
{
swap the entry to be inserted with the current entry.
Try to insert the current entry in the same logic.
}

So, the second approach will not cause all the followers to be shifted by 1.

How not? You'll have to do that game until you found a free slot.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Andres Freund (#14)
Re: Performance degradation in TPC-H Q18

On Wed, Mar 1, 2017 at 10:53 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-03-01 10:47:45 +0530, Kuntal Ghosh wrote:

if (insertdist > curdist)
{
swap the entry to be inserted with the current entry.
Try to insert the current entry in the same logic.
}

So, the second approach will not cause all the followers to be shifted by 1.

How not? You'll have to do that game until you found a free slot.

You can skip increasing the curdist of some elements for which it is
already pretty high. Suppose, we've the following hash table and an
element e,
_,_,_,i,_,_,j,j+1,j+2,_,_
We want to insert e at i but there is a collision and we start doing
probing. At j, the condition if (insertdist > curdist) satisfies and
we'll swap e with the element at j. Now, we try to insert e( = element
at j) and so on. In this process, if the element at j+1,j+2 has
already high distance from their optimal location, you can skip those
element and go ahead. But, in the existing approach, we increase their
distance further. Thoughts?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Andres Freund
andres@anarazel.de
In reply to: Kuntal Ghosh (#12)
Re: Performance degradation in TPC-H Q18

Hi,

On 2017-03-01 10:39:11 +0530, Kuntal Ghosh wrote:

On Wed, Mar 1, 2017 at 9:33 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Wed, Mar 1, 2017 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote:

That's without the patch in
http://archives.postgresql.org/message-id/20161123083351.5vramz52nmdokhzz%40alap3.anarazel.de
? With that patch it should complete without that change, right?

No, it's with the patch in the above link which is
0001-Resize-simplehash.h-table-in-case-of-long-runs.patch. But, I've
kept the SH_FILLFACTOR to 0.8 and SH_MAX_FILLFACTOR to 0.95. I'll do
another round of testing with the constants provided by you.

I've tested with the patch
0001-Resize-simplehash.h-table-in-case-of-long-runs.patch and the
results are same, i.e., hash table grows even when it is only 10-12%
filled. I've attached the logfile for reference.

So, I observed similar things, where the *leader backend* reports a lot
of "long runs".

WARNING: clumping, growing this thing
LOG: size: 1048576, members: 158071, filled: 0.150748, total chain: 95622, max chain: 35, avg chain: 0.604931, total_collisions: 35909, max_collisions: 5, avg_collisions: 0.227170
STATEMENT: EXPLAIN ANALYZE select l_orderkey from lineitem group by l_orderkey;
WARNING: clumping, growing this thing
LOG: size: 2097152, members: 238992, filled: 0.113960, total chain: 1850141, max chain: 422, avg chain: 7.741435, total_collisions: 55363, max_collisions: 5, avg_collisions: 0.231652
STATEMENT: EXPLAIN ANALYZE select l_orderkey from lineitem group by l_orderkey;
WARNING: clumping, growing this thing
LOG: size: 4194304, members: 881089, filled: 0.210068, total chain: 567907, max chain: 29, avg chain: 0.644551, total_collisions: 174681, max_collisions: 6, avg_collisions: 0.198256
STATEMENT: EXPLAIN ANALYZE select l_orderkey from lineitem group by l_orderkey;
WARNING: clumping, growing this thing
LOG: size: 8388608, members: 5397649, filled: 0.643450, total chain: 5752690, max chain: 22, avg chain: 1.065777, total_collisions: 1437957, max_collisions: 8, avg_collisions: 0.266404
STATEMENT: EXPLAIN ANALYZE select l_orderkey from lineitem group by l_orderkey;

The pertinent part is that there's "forced" resizes due to long runs at
0.15%, 0.11%, 0.21%, 0.64%. Note that there's a good number of
"ordinary" resizes, and at the end there's 7500000 tuples in the
hashtable. I.e. there's a number of resizes after the last one due to
clumping, and all of those are entirely guided through fillfactor; the
final hash-table is entirely reasonably sized.

Trying to do the math, that just didn't sit well with me. There's just
no way there could be that many and such long runs, without something
broken. And indeed:

Thinking about this even more I realized the reason this happens is that
b81b5a96f4 didn't actually address the problem of inserting in
hash-table order. To understand, let's first look at the query plan:

Finalize HashAggregate (cost=899227.75..903176.55 rows=394880 width=4) (actual time=6255.957..7415.152 rows=7500000 loops=1)
Group Key: l_orderkey
-> Gather (cost=652427.75..893304.55 rows=2369280 width=4) (actual time=1741.706..3699.248 rows=7940881 loops=1)
Workers Planned: 6
Workers Launched: 6
-> Partial HashAggregate (cost=651427.75..655376.55 rows=394880 width=4) (actual time=1731.468..1956.693 rows=11344
Group Key: l_orderkey
-> Parallel Seq Scan on lineitem (cost=0.00..638927.80 rows=4999980 width=4) (actual time=0.025..711.771 rows
Planning time: 0.192 ms
Execution time: 7700.027 ms

The important part is the estimated rows=394880 and actual
rows=7500000. That means that the hash-table on the leader backend will
first be created suitably for 394880 rows. But we obviously get a lot
more.

The problem then is that even after introducing the hash_iv stuff
(b81b5a96f), we essentially still iterate over the tuples in
hash-order. TupleHashTableHash() computes the hash for a single-column
group condition as:

uint32 hashkey = hashtable->hash_iv;

hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
attr = slot_getattr(slot, att, &isNull);
if (!isNull) /* treat nulls as having hash key 0 */
{
uint32 hkey;

hkey = DatumGetUInt32(FunctionCall1(&hashfunctions[i], attr));
hashkey ^= hkey;
}

the resulting hash-values aren't actually meaningfully influenced by the
IV. Because we just xor with the IV, most hash-value that without the IV
would have fallen into a single hash-bucket, fall into a single
hash-bucket afterwards as well; just somewhere else in the hash-range.

Which then leads to the issue that we insert hundreds of rows into the
leader's hash-value that fall into the same hash-bucket (as the leader's
hashtable is quite small at this point, the hash-buckets on the worker
tables have *far* fewer entries per bucket / are far smaller on the
workers than the leaders). Which then leads to super-long runs.

As soon as the table gets to it "real" size, the issue of the overlong
runs is gone, because now the hash-table has the appropriate size for
its contents.

So that explains why there's a there's a lot of resizes when the
fillfactor is low (as you've observed seen in the 10-20% range) - in
specific hash-ranges the fillfactor is extremely high, but in other
parts the table is effectively empty.

The question remains what to do about that... I think it's pretty clear
that we need a "defense" against such unbalanced hashtables - now that I
understand how we can get into the situation of these being so
unbalanced I'm a lot less uncomfortable adding the necessary logic.

In addition to that it seems quite worthwhile to provide an iterator
that's not vulnerable to this. An approach that I am, seemingly
successfully, testing is to iterate the hashtable in multiple (in my
case 23, because why not) passes, accessing only every nth element. That
allows the data to be inserted in a lot less "dense" fashion. But
that's more an optimization, so I'll just push something like the patch
mentioned in the thread already.

Makes some sense?

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Andres Freund
andres@anarazel.de
In reply to: Kuntal Ghosh (#15)
Re: Performance degradation in TPC-H Q18

On 2017-03-01 11:05:33 +0530, Kuntal Ghosh wrote:

On Wed, Mar 1, 2017 at 10:53 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-03-01 10:47:45 +0530, Kuntal Ghosh wrote:

if (insertdist > curdist)
{
swap the entry to be inserted with the current entry.
Try to insert the current entry in the same logic.
}

So, the second approach will not cause all the followers to be shifted by 1.

How not? You'll have to do that game until you found a free slot.

You can skip increasing the curdist of some elements for which it is
already pretty high. Suppose, we've the following hash table and an
element e,
_,_,_,i,_,_,j,j+1,j+2,_,_
We want to insert e at i but there is a collision and we start doing
probing. At j, the condition if (insertdist > curdist) satisfies and
we'll swap e with the element at j. Now, we try to insert e( = element
at j) and so on. In this process, if the element at j+1,j+2 has
already high distance from their optimal location, you can skip those
element and go ahead. But, in the existing approach, we increase their
distance further. Thoughts?

All the following elements already are at their "optimal" position given
the previous occupation of the hash-table. As we only start to move
once the insert-distance of the existing element is lower than the one
of the to-be-inserted one, the following elements will all be moved.
Doing the swap on a element-by-element basis would be possible, but just
achieves the same result at a higher cost.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#16)
Re: Performance degradation in TPC-H Q18

On Fri, Mar 3, 2017 at 1:22 AM, Andres Freund <andres@anarazel.de> wrote:

the resulting hash-values aren't actually meaningfully influenced by the
IV. Because we just xor with the IV, most hash-value that without the IV
would have fallen into a single hash-bucket, fall into a single
hash-bucket afterwards as well; just somewhere else in the hash-range.

Wow, OK. I had kind of assumed (without looking) that setting the
hash IV did something a little more useful than that. Maybe we should
do something like struct blah { int iv; int hv; }; newhv =
hash_any(&blah, sizeof(blah)).

In addition to that it seems quite worthwhile to provide an iterator
that's not vulnerable to this. An approach that I am, seemingly
successfully, testing is to iterate the hashtable in multiple (in my
case 23, because why not) passes, accessing only every nth element. That
allows the data to be inserted in a lot less "dense" fashion. But
that's more an optimization, so I'll just push something like the patch
mentioned in the thread already.

Makes some sense?

Yep.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#18)
Re: Performance degradation in TPC-H Q18

On Fri, Mar 3, 2017 at 8:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 3, 2017 at 1:22 AM, Andres Freund <andres@anarazel.de> wrote:

the resulting hash-values aren't actually meaningfully influenced by the
IV. Because we just xor with the IV, most hash-value that without the IV
would have fallen into a single hash-bucket, fall into a single
hash-bucket afterwards as well; just somewhere else in the hash-range.

Wow, OK. I had kind of assumed (without looking) that setting the
hash IV did something a little more useful than that. Maybe we should
do something like struct blah { int iv; int hv; }; newhv =
hash_any(&blah, sizeof(blah)).

Sounds good. I've seen a post from Thomas Munro suggesting some
alternative approach for combining hash values in execGrouping.c[1]/messages/by-id/CAEepm=3rdgjfxW4cKvJ0OEmya2-34B0qHNG1xV0vK7TGPJGMUQ@mail.gmail.com -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com.

In addition to that it seems quite worthwhile to provide an iterator
that's not vulnerable to this. An approach that I am, seemingly
successfully, testing is to iterate the hashtable in multiple (in my
case 23, because why not) passes, accessing only every nth element. That
allows the data to be inserted in a lot less "dense" fashion. But
that's more an optimization, so I'll just push something like the patch
mentioned in the thread already.

Makes some sense?

Yep.

Yes, it makes sense. Quadratic probing is another approach, but it
would require an extra shift op every time we want to find the next or
prev location during a collision.

[1]: /messages/by-id/CAEepm=3rdgjfxW4cKvJ0OEmya2-34B0qHNG1xV0vK7TGPJGMUQ@mail.gmail.com -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Andres Freund
andres@anarazel.de
In reply to: Kuntal Ghosh (#19)
Re: Performance degradation in TPC-H Q18

Hi,

attached is a patch to address this problem, and the one reported by
Dilip. I ran a lot of TPC-H and other benchmarks, and so far this
addresses all the performance issues, often being noticeably faster than
with the dynahash code.

Comments?

On 2017-03-03 11:23:00 +0530, Kuntal Ghosh wrote:

On Fri, Mar 3, 2017 at 8:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 3, 2017 at 1:22 AM, Andres Freund <andres@anarazel.de> wrote:

the resulting hash-values aren't actually meaningfully influenced by the
IV. Because we just xor with the IV, most hash-value that without the IV
would have fallen into a single hash-bucket, fall into a single
hash-bucket afterwards as well; just somewhere else in the hash-range.

Wow, OK. I had kind of assumed (without looking) that setting the
hash IV did something a little more useful than that. Maybe we should
do something like struct blah { int iv; int hv; }; newhv =
hash_any(&blah, sizeof(blah)).

The hash invocations are already noticeable performancewise, so I'm a
bit hesitant to go there. I'd rather introduce a decent 'hash_combine'
function or such.

Sounds good. I've seen a post from Thomas Munro suggesting some
alternative approach for combining hash values in execGrouping.c[1].

Yea, I played with that too - it's a bit better, but doesn't really
address the problem fully either. Seems to generally reduce collisions
in multi-column keys a bit, but not that much.

In addition to that it seems quite worthwhile to provide an iterator
that's not vulnerable to this. An approach that I am, seemingly
successfully, testing is to iterate the hashtable in multiple (in my
case 23, because why not) passes, accessing only every nth element. That
allows the data to be inserted in a lot less "dense" fashion. But
that's more an optimization, so I'll just push something like the patch
mentioned in the thread already.

Makes some sense?

Yep.

Yes, it makes sense. Quadratic probing is another approach, but it
would require an extra shift op every time we want to find the next or
prev location during a collision.

That's not really the big problem with quadratic probing. The issue is
that it leads to further random unpredictable memory accesses in case of
collisions, whereas linear probing has easy to predict accesses.

Greetings,

Andres Freund

Attachments:

0001-Make-simplehash.h-grow-hashtable-in-additional-cases.patchtext/x-patch; charset=us-asciiDownload+61-6
#21Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#20)
#22Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Robert Haas (#21)
#23Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#21)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#23)
#25Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#24)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#25)
#27Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#26)