pgsql: Add parallel-aware hash joins.

Started by Andres Freundover 8 years ago90 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Add parallel-aware hash joins.

Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.

After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.

The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:

* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs

One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.

A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.

A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.

Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
/messages/by-id/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
/messages/by-id/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/1804284042e659e7d16904e7bbb0ad546394b6a3

Modified Files
--------------
doc/src/sgml/config.sgml | 15 +
doc/src/sgml/monitoring.sgml | 62 +-
src/backend/executor/execParallel.c | 21 +
src/backend/executor/execProcnode.c | 3 +
src/backend/executor/nodeHash.c | 1659 +++++++++++++++++++++++--
src/backend/executor/nodeHashjoin.c | 617 ++++++++-
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 25 +-
src/backend/optimizer/path/joinpath.c | 36 +-
src/backend/optimizer/plan/createplan.c | 11 +
src/backend/optimizer/util/pathnode.c | 5 +-
src/backend/postmaster/pgstat.c | 45 +
src/backend/utils/misc/guc.c | 10 +-
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/executor/hashjoin.h | 175 ++-
src/include/executor/nodeHash.h | 24 +-
src/include/executor/nodeHashjoin.h | 6 +
src/include/nodes/execnodes.h | 6 +
src/include/nodes/plannodes.h | 1 +
src/include/nodes/relation.h | 2 +
src/include/optimizer/cost.h | 4 +-
src/include/optimizer/pathnode.h | 1 +
src/include/pgstat.h | 15 +
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/join.out | 304 ++++-
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/join.sql | 148 ++-
src/tools/pgindent/typedefs.list | 4 +
30 files changed, 3091 insertions(+), 116 deletions(-)

#2Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: pgsql: Add parallel-aware hash joins.

On 2017-12-21 08:49:46 +0000, Andres Freund wrote:

Add parallel-aware hash joins.

There's to relatively mundane failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2017-12-21%2008%3A48%3A12
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2017-12-21%2008%3A50%3A08

but also one that's a lot more interesting:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=capybara&dt=2017-12-21%2008%3A50%3A08

which shows an assert failure:

#2 0x00000000008687d1 in ExceptionalCondition (conditionName=conditionName@entry=0xa76a98 "!(!accessor->sts->participants[i].writing)", errorType=errorType@entry=0x8b2c49 "FailedAssertion", fileName=fileName@entry=0xa76991 "sharedtuplestore.c", lineNumber=lineNumber@entry=273) at assert.c:54
#3 0x000000000089883e in sts_begin_parallel_scan (accessor=0xfaf780) at sharedtuplestore.c:273
#4 0x0000000000634de4 in ExecParallelHashRepartitionRest (hashtable=0xfaec18) at nodeHash.c:1369
#5 ExecParallelHashIncreaseNumBatches (hashtable=0xfaec18) at nodeHash.c:1198
#6 0x000000000063546b in ExecParallelHashTupleAlloc (hashtable=hashtable@entry=0xfaec18, size=40, shared=shared@entry=0x7ffee26a8868) at nodeHash.c:2778
#7 0x00000000006357c8 in ExecParallelHashTableInsert (hashtable=hashtable@entry=0xfaec18, slot=slot@entry=0xfa76f8, hashvalue=<optimized out>) at nodeHash.c:1696
#8 0x0000000000635b5f in MultiExecParallelHash (node=0xf7ebc8) at nodeHash.c:288
#9 MultiExecHash (node=node@entry=0xf7ebc8) at nodeHash.c:112

which seems to suggest that something in the state machine logic is
borked. ExecParallelHashIncreaseNumBatches() should've ensured that
everyone has called sts_end_write()...

Greetings,

Andres Freund

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#2)
Re: pgsql: Add parallel-aware hash joins.

On Thu, Dec 21, 2017 at 10:29 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-12-21 08:49:46 +0000, Andres Freund wrote:

Add parallel-aware hash joins.

There's to relatively mundane failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&amp;dt=2017-12-21%2008%3A48%3A12
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&amp;dt=2017-12-21%2008%3A50%3A08

Right, it looks like something takes more space on ppc systems causing
a batch increase that doesn't happen on amd64. I'll come back to
that.

but also one that's a lot more interesting:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=capybara&amp;dt=2017-12-21%2008%3A50%3A08

which shows an assert failure:

#2 0x00000000008687d1 in ExceptionalCondition (conditionName=conditionName@entry=0xa76a98 "!(!accessor->sts->participants[i].writing)", errorType=errorType@entry=0x8b2c49 "FailedAssertion", fileName=fileName@entry=0xa76991 "sharedtuplestore.c", lineNumber=lineNumber@entry=273) at assert.c:54
#3 0x000000000089883e in sts_begin_parallel_scan (accessor=0xfaf780) at sharedtuplestore.c:273
#4 0x0000000000634de4 in ExecParallelHashRepartitionRest (hashtable=0xfaec18) at nodeHash.c:1369
#5 ExecParallelHashIncreaseNumBatches (hashtable=0xfaec18) at nodeHash.c:1198
#6 0x000000000063546b in ExecParallelHashTupleAlloc (hashtable=hashtable@entry=0xfaec18, size=40, shared=shared@entry=0x7ffee26a8868) at nodeHash.c:2778
#7 0x00000000006357c8 in ExecParallelHashTableInsert (hashtable=hashtable@entry=0xfaec18, slot=slot@entry=0xfa76f8, hashvalue=<optimized out>) at nodeHash.c:1696
#8 0x0000000000635b5f in MultiExecParallelHash (node=0xf7ebc8) at nodeHash.c:288
#9 MultiExecHash (node=node@entry=0xf7ebc8) at nodeHash.c:112

which seems to suggest that something in the state machine logic is
borked. ExecParallelHashIncreaseNumBatches() should've ensured that
everyone has called sts_end_write()...

Hmm. This looks the same as the one-off single assertion failure that
I mentioned[1]/messages/by-id/CAEepm=0oE=yO0Kam86W1d-iJoasWByYkcrkDoJu6t5mRhFGHkQ@mail.gmail.com and had not been able to reproduce. Investigating.

[1]: /messages/by-id/CAEepm=0oE=yO0Kam86W1d-iJoasWByYkcrkDoJu6t5mRhFGHkQ@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

#4Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#2)
Re: pgsql: Add parallel-aware hash joins.

On 2017-12-21 01:29:40 -0800, Andres Freund wrote:

On 2017-12-21 08:49:46 +0000, Andres Freund wrote:

Add parallel-aware hash joins.

There's to relatively mundane failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&amp;dt=2017-12-21%2008%3A48%3A12
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&amp;dt=2017-12-21%2008%3A50%3A08

but also one that's a lot more interesting:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=capybara&amp;dt=2017-12-21%2008%3A50%3A08

which shows an assert failure:

#2 0x00000000008687d1 in ExceptionalCondition (conditionName=conditionName@entry=0xa76a98 "!(!accessor->sts->participants[i].writing)", errorType=errorType@entry=0x8b2c49 "FailedAssertion", fileName=fileName@entry=0xa76991 "sharedtuplestore.c", lineNumber=lineNumber@entry=273) at assert.c:54
#3 0x000000000089883e in sts_begin_parallel_scan (accessor=0xfaf780) at sharedtuplestore.c:273
#4 0x0000000000634de4 in ExecParallelHashRepartitionRest (hashtable=0xfaec18) at nodeHash.c:1369
#5 ExecParallelHashIncreaseNumBatches (hashtable=0xfaec18) at nodeHash.c:1198
#6 0x000000000063546b in ExecParallelHashTupleAlloc (hashtable=hashtable@entry=0xfaec18, size=40, shared=shared@entry=0x7ffee26a8868) at nodeHash.c:2778
#7 0x00000000006357c8 in ExecParallelHashTableInsert (hashtable=hashtable@entry=0xfaec18, slot=slot@entry=0xfa76f8, hashvalue=<optimized out>) at nodeHash.c:1696
#8 0x0000000000635b5f in MultiExecParallelHash (node=0xf7ebc8) at nodeHash.c:288
#9 MultiExecHash (node=node@entry=0xf7ebc8) at nodeHash.c:112

which seems to suggest that something in the state machine logic is
borked. ExecParallelHashIncreaseNumBatches() should've ensured that
everyone has called sts_end_write()...

Thomas, I wonder if the problem is that PHJ_GROW_BATCHES_ELECTING
updates, via ExecParallelHashJoinSetUpBatches(), HashJoinTable->nbatch,
while other backends also access ->nbatch in
ExecParallelHashCloseBatchAccessors(). Both happens after waiting for
the WAIT_EVENT_HASH_GROW_BATCHES_ELECTING phase.

That'd lead to ExecParallelHashCloseBatchAccessors() likely not finish
writing all batches (because nbatch < nbatch_old), which seems like it'd
explain this?

Greetings,

Andres Freund

#5Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#4)
Re: pgsql: Add parallel-aware hash joins.

On 2017-12-21 01:55:50 -0800, Andres Freund wrote:

On 2017-12-21 01:29:40 -0800, Andres Freund wrote:

On 2017-12-21 08:49:46 +0000, Andres Freund wrote:

Add parallel-aware hash joins.

There's to relatively mundane failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&amp;dt=2017-12-21%2008%3A48%3A12
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&amp;dt=2017-12-21%2008%3A50%3A08

but also one that's a lot more interesting:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=capybara&amp;dt=2017-12-21%2008%3A50%3A08

which shows an assert failure:

#2 0x00000000008687d1 in ExceptionalCondition (conditionName=conditionName@entry=0xa76a98 "!(!accessor->sts->participants[i].writing)", errorType=errorType@entry=0x8b2c49 "FailedAssertion", fileName=fileName@entry=0xa76991 "sharedtuplestore.c", lineNumber=lineNumber@entry=273) at assert.c:54
#3 0x000000000089883e in sts_begin_parallel_scan (accessor=0xfaf780) at sharedtuplestore.c:273
#4 0x0000000000634de4 in ExecParallelHashRepartitionRest (hashtable=0xfaec18) at nodeHash.c:1369
#5 ExecParallelHashIncreaseNumBatches (hashtable=0xfaec18) at nodeHash.c:1198
#6 0x000000000063546b in ExecParallelHashTupleAlloc (hashtable=hashtable@entry=0xfaec18, size=40, shared=shared@entry=0x7ffee26a8868) at nodeHash.c:2778
#7 0x00000000006357c8 in ExecParallelHashTableInsert (hashtable=hashtable@entry=0xfaec18, slot=slot@entry=0xfa76f8, hashvalue=<optimized out>) at nodeHash.c:1696
#8 0x0000000000635b5f in MultiExecParallelHash (node=0xf7ebc8) at nodeHash.c:288
#9 MultiExecHash (node=node@entry=0xf7ebc8) at nodeHash.c:112

which seems to suggest that something in the state machine logic is
borked. ExecParallelHashIncreaseNumBatches() should've ensured that
everyone has called sts_end_write()...

Thomas, I wonder if the problem is that PHJ_GROW_BATCHES_ELECTING
updates, via ExecParallelHashJoinSetUpBatches(), HashJoinTable->nbatch,
while other backends also access ->nbatch in
ExecParallelHashCloseBatchAccessors(). Both happens after waiting for
the WAIT_EVENT_HASH_GROW_BATCHES_ELECTING phase.

That'd lead to ExecParallelHashCloseBatchAccessors() likely not finish
writing all batches (because nbatch < nbatch_old), which seems like it'd
explain this?

Trying to debug this I found another issue. I'd placed a sleep(10) in
ExecParallelHashCloseBatchAccessors() and then ctrl-c'ed the server for
some reason. Segfault time:

#0 0x000055bfbac42539 in tas (lock=0x7fcd82ae14ac <error: Cannot access memory at address 0x7fcd82ae14ac>) at /home/andres/src/postgresql/src/include/storage/s_lock.h:228
#1 0x000055bfbac42b4d in ConditionVariableCancelSleep () at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:173
#2 0x000055bfba8e24ae in AbortTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:2478
#3 0x000055bfba8e4a2a in AbortOutOfAnyTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:4387
#4 0x000055bfba91ed97 in RemoveTempRelationsCallback (code=1, arg=0) at /home/andres/src/postgresql/src/backend/catalog/namespace.c:4034
#5 0x000055bfbac1bc90 in shmem_exit (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:228
#6 0x000055bfbac1bb67 in proc_exit_prepare (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:185
#7 0x000055bfbac1bacf in proc_exit (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:102
#8 0x000055bfbadbccf0 in errfinish (dummy=0) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:543
#9 0x000055bfbac4eda3 in ProcessInterrupts () at /home/andres/src/postgresql/src/backend/tcop/postgres.c:2917
#10 0x000055bfbac42a63 in ConditionVariableSleep (cv=0x7fcd82ae14ac, wait_event_info=134217742) at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:129
#11 0x000055bfbac18405 in BarrierArriveAndWait (barrier=0x7fcd82ae1494, wait_event_info=134217742) at /home/andres/src/postgresql/src/backend/storage/ipc/barrier.c:191
#12 0x000055bfbaa9361e in ExecParallelHashIncreaseNumBatches (hashtable=0x55bfbd0e11d0) at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:1191
#13 0x000055bfbaa962ef in ExecParallelHashTupleAlloc (hashtable=0x55bfbd0e11d0, size=40, shared=0x7ffda8967050) at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:2781
#14 0x000055bfbaa946e8 in ExecParallelHashTableInsert (hashtable=0x55bfbd0e11d0, slot=0x55bfbd089a80, hashvalue=3825063138) at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:1699
#15 0x000055bfbaa91d90 in MultiExecParallelHash (node=0x55bfbd089610) at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:288
#16 0x000055bfbaa919b9 in MultiExecHash (node=0x55bfbd089610) at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:112
#17 0x000055bfbaa7a500 in MultiExecProcNode (node=0x55bfbd089610) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:502
#18 0x000055bfbaa98515 in ExecHashJoinImpl (parallel=1 '\001', pstate=0x55bfbd053d50) at /home/andres/src/postgresql/src/backend/executor/nodeHashjoin.c:291
#19 ExecParallelHashJoin (pstate=0x55bfbd053d50) at /home/andres/src/postgresql/src/backend/executor/nodeHashjoin.c:582
#20 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd053d50) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#21 0x000055bfbaa858c7 in ExecProcNode (node=0x55bfbd053d50) at /home/andres/src/postgresql/src/include/executor/executor.h:242
#22 0x000055bfbaa85d67 in fetch_input_tuple (aggstate=0x55bfbd053698) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:699
#23 0x000055bfbaa889b5 in agg_retrieve_direct (aggstate=0x55bfbd053698) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2355
#24 0x000055bfbaa8858e in ExecAgg (pstate=0x55bfbd053698) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2166
#25 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd053698) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#26 0x000055bfbaa90a4e in ExecProcNode (node=0x55bfbd053698) at /home/andres/src/postgresql/src/include/executor/executor.h:242
#27 0x000055bfbaa910d5 in gather_getnext (gatherstate=0x55bfbd053340) at /home/andres/src/postgresql/src/backend/executor/nodeGather.c:285
#28 0x000055bfbaa90f5f in ExecGather (pstate=0x55bfbd053340) at /home/andres/src/postgresql/src/backend/executor/nodeGather.c:216
#29 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd053340) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#30 0x000055bfbaa858c7 in ExecProcNode (node=0x55bfbd053340) at /home/andres/src/postgresql/src/include/executor/executor.h:242
#31 0x000055bfbaa85d67 in fetch_input_tuple (aggstate=0x55bfbd052c18) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:699
#32 0x000055bfbaa889b5 in agg_retrieve_direct (aggstate=0x55bfbd052c18) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2355
#33 0x000055bfbaa8858e in ExecAgg (pstate=0x55bfbd052c18) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2166
#34 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd052c18) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#35 0x000055bfbaa716c6 in ExecProcNode (node=0x55bfbd052c18) at /home/andres/src/postgresql/src/include/executor/executor.h:242
#36 0x000055bfbaa7404e in ExecutePlan (estate=0x55bfbd0529c8, planstate=0x55bfbd052c18, use_parallel_mode=1 '\001', operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection, dest=0x55bfbd121ae0, execute_once=1 '\001')
at /home/andres/src/postgresql/src/backend/executor/execMain.c:1718
#37 0x000055bfbaa71cba in standard_ExecutorRun (queryDesc=0x55bfbcf0fe68, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at /home/andres/src/postgresql/src/backend/executor/execMain.c:361
#38 0x000055bfbaa71ad4 in ExecutorRun (queryDesc=0x55bfbcf0fe68, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at /home/andres/src/postgresql/src/backend/executor/execMain.c:304
#39 0x000055bfbac52725 in PortalRunSelect (portal=0x55bfbcf56a48, forward=1 '\001', count=0, dest=0x55bfbd121ae0) at /home/andres/src/postgresql/src/backend/tcop/pquery.c:932
#40 0x000055bfbac523b8 in PortalRun (portal=0x55bfbcf56a48, count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001', dest=0x55bfbd121ae0, altdest=0x55bfbd121ae0, completionTag=0x7ffda8967840 "") at /home/andres/src/postgresql/src/backend/tcop/pquery.c:773
#41 0x000055bfbac4c0a1 in exec_simple_query (query_string=0x55bfbceefaf8 "select count(*) from simple r join bigger_than_it_looks s using (id);") at /home/andres/src/postgresql/src/backend/tcop/postgres.c:1120
#42 0x000055bfbac505e4 in PostgresMain (argc=1, argv=0x55bfbcf1d178, dbname=0x55bfbcf1cf30 "regression", username=0x55bfbceec588 "andres") at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4139
#43 0x000055bfbabac375 in BackendRun (port=0x55bfbcf120f0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4412
#44 0x000055bfbababa74 in BackendStartup (port=0x55bfbcf120f0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4084
#45 0x000055bfbaba7d49 in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1757
#46 0x000055bfbaba72d2 in PostmasterMain (argc=39, argv=0x55bfbcee9e90) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1365
#47 0x000055bfbaadb49d in main (argc=39, argv=0x55bfbcee9e90) at /home/andres/src/postgresql/src/backend/main/main.c:228

So, afaics no workers had yet attached, the leader accepted the cancel
interrupt, the dsm segments were destroyed, and as part of cleanup
cv_sleep_target was supposed to be reset, which fails, because it's
memory has since been freed. Looking at how that can happen.

Greetings,

Andres Freund

#6Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#5)
condition variable cleanup and subtransactions

Hi,

On 2017-12-21 02:42:25 -0800, Andres Freund wrote:

Trying to debug this I found another issue. I'd placed a sleep(10) in
ExecParallelHashCloseBatchAccessors() and then ctrl-c'ed the server for
some reason. Segfault time:

#0 0x000055bfbac42539 in tas (lock=0x7fcd82ae14ac <error: Cannot access memory at address 0x7fcd82ae14ac>) at /home/andres/src/postgresql/src/include/storage/s_lock.h:228
#1 0x000055bfbac42b4d in ConditionVariableCancelSleep () at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:173
#2 0x000055bfba8e24ae in AbortTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:2478
#3 0x000055bfba8e4a2a in AbortOutOfAnyTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:4387

So, afaics no workers had yet attached, the leader accepted the cancel
interrupt, the dsm segments were destroyed, and as part of cleanup
cv_sleep_target was supposed to be reset, which fails, because it's
memory has since been freed. Looking at how that can happen.

Oh. This seems to be a condition variable bug independent of PHJ. The
problem is that the DSM segment etc all get cleaned up in
*subtransaction* abort.

Afaict it's a bug that AbortTransaction() does
ConditionVariableCancelSleep() but AbortSubTransaction() does not,
despite the latter releasing dsm segments via
ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS).

Adding that seems to fix the crash.

This seems like something we need to backpatch.

Greetings,

Andres Freund

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#6)
Re: condition variable cleanup and subtransactions

On Fri, Dec 22, 2017 at 12:00 AM, Andres Freund <andres@anarazel.de> wrote:

Oh. This seems to be a condition variable bug independent of PHJ. The
problem is that the DSM segment etc all get cleaned up in
*subtransaction* abort.

Afaict it's a bug that AbortTransaction() does
ConditionVariableCancelSleep() but AbortSubTransaction() does not,
despite the latter releasing dsm segments via
ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS).

Adding that seems to fix the crash.

This seems like something we need to backpatch.

Agreed. That affects any user of condition variables inside DSM
segments, including the released Parallel Index Scan and Parallel
Bitmap Heap Scan code.

--
Thomas Munro
http://www.enterprisedb.com

#8Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#4)
Re: pgsql: Add parallel-aware hash joins.

On Thu, Dec 21, 2017 at 10:55 PM, Andres Freund <andres@anarazel.de> wrote:

Thomas, I wonder if the problem is that PHJ_GROW_BATCHES_ELECTING
updates, via ExecParallelHashJoinSetUpBatches(), HashJoinTable->nbatch,
while other backends also access ->nbatch in
ExecParallelHashCloseBatchAccessors(). Both happens after waiting for
the WAIT_EVENT_HASH_GROW_BATCHES_ELECTING phase.

That'd lead to ExecParallelHashCloseBatchAccessors() likely not finish
writing all batches (because nbatch < nbatch_old), which seems like it'd
explain this?

I don't think that's quite it, because it should never have set
'writing' for any batch number >= nbatch.

It's late here, but I'll take this up tomorrow and either find a fix
or figure out how to avoid antisocial noise levels on the build farm
in the meantime.

--
Thomas Munro
http://www.enterprisedb.com

#9Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#7)
Re: condition variable cleanup and subtransactions

On Thu, Dec 21, 2017 at 6:46 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Fri, Dec 22, 2017 at 12:00 AM, Andres Freund <andres@anarazel.de> wrote:

Oh. This seems to be a condition variable bug independent of PHJ. The
problem is that the DSM segment etc all get cleaned up in
*subtransaction* abort.

Afaict it's a bug that AbortTransaction() does
ConditionVariableCancelSleep() but AbortSubTransaction() does not,
despite the latter releasing dsm segments via
ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS).

Adding that seems to fix the crash.

This seems like something we need to backpatch.

Agreed. That affects any user of condition variables inside DSM
segments, including the released Parallel Index Scan and Parallel
Bitmap Heap Scan code.

Fixed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#2)
Re: pgsql: Add parallel-aware hash joins.

Andres Freund <andres@anarazel.de> writes:

On 2017-12-21 08:49:46 +0000, Andres Freund wrote:

Add parallel-aware hash joins.

There's to relatively mundane failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&amp;dt=2017-12-21%2008%3A48%3A12
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&amp;dt=2017-12-21%2008%3A50%3A08

[ as well as various assertion crashes ]

I see my machine prairiedog is showing yet another variation, possibly
related to the tern/termite failure, but not the same:

================= pgsql.build/src/test/regress/regression.diffs ===================
*** /Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/join.out	Thu Dec 21 03:55:15 2017
--- /Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/join.out	Thu Dec 21 10:56:00 2017
***************
*** 6682,6688 ****
  $$);
   multibatch 
  ------------
!  t
  (1 row)
  rollback to settings;
--- 6682,6688 ----
  $$);
   multibatch 
  ------------
!  f
  (1 row)

rollback to settings;

======================================================================

Let me know if you need me to poke into that.

regards, tom lane

#11Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#8)
Re: pgsql: Add parallel-aware hash joins.

On Fri, Dec 22, 2017 at 1:48 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I don't think that's quite it, because it should never have set
'writing' for any batch number >= nbatch.

It's late here, but I'll take this up tomorrow and either find a fix
or figure out how to avoid antisocial noise levels on the build farm
in the meantime.

Not there yet but I learned some things and am still working on it. I
spent a lot of time trying to reproduce the assertion failure, and
succeeded exactly once. Unfortunately the one time I managed do to
that I'd built with clang -O2 and got a core file that I couldn't get
much useful info out of, and I've been trying to do it again with -O0
ever since without luck. The time I succeeded, I reproduced it by
creating the tables "simple" and "bigger_than_it_looks" from join.sql
and then doing this in a loop:

set min_parallel_table_scan_size = 0;
set parallel_setup_cost = 0;
set work_mem = '192kB';

explain analyze select count(*) from simple r join
bigger_than_it_looks s using (id);

The machine that it happened on is resource constrained, and exhibits
another problem: though the above query normally runs in ~20ms,
sometimes it takes several seconds and occasionally much longer. That
never happens on fast development systems or test servers which run it
quickly every time, and it doesn't happen on my 2 core slow system if
I have only two workers (or one worker + leader). I dug into that and
figured out what was going wrong and wrote that up separately[1]/messages/by-id/CAEepm=0NWKehYw7NDoUSf8juuKOPRnCyY3vuaSvhrEWsOTAa3w@mail.gmail.com,
because I think it's an independent bug needing to be fixed, not the
root cause here. However, I think it could easily be contributing to
the timing required to trigger the bug we're looking for.

Andres, your machine francolin crashed -- got a core file?

[1]: /messages/by-id/CAEepm=0NWKehYw7NDoUSf8juuKOPRnCyY3vuaSvhrEWsOTAa3w@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

#12Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#11)
Re: pgsql: Add parallel-aware hash joins.

Hi,

On 2017-12-22 21:16:10 +1300, Thomas Munro wrote:

Andres, your machine francolin crashed -- got a core file?

Unfortunately not - it appears the buildfarm cleared it away :(

Might try to reproduce it on that machine...

Greetings,

Andres Freund

#13Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#12)
Re: pgsql: Add parallel-aware hash joins.

On Fri, Dec 22, 2017 at 9:22 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-12-22 21:16:10 +1300, Thomas Munro wrote:

Andres, your machine francolin crashed -- got a core file?

Unfortunately not - it appears the buildfarm cleared it away :(

I now have a workload that fails within a few minutes or so on master.
I see the problem: MultiExecParallelHash() needs to run
sts_end_write() *before* detaching from grow_batches_barrier. My test
case doesn't fail with the attached patch applied.

I'll address the instability of the regression test output separately.
Sorry for the delay, due to the time of year.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

fix-phj-writing-assertion-failure.patchapplication/octet-stream; name=fix-phj-writing-assertion-failure.patchDownload+3-3
#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#13)
Re: pgsql: Add parallel-aware hash joins.

Thomas Munro <thomas.munro@enterprisedb.com> writes:

I'll address the instability of the regression test output separately.

If you're still looking for data on that --- prairiedog is able to
reproduce the "multibatch = f" variant about one time in thirty.
I modified the test case to print out the full EXPLAIN ANALYZE output
rather than a heavily filtered version. Here's a typical successful run:

! explain analyze
select length(max(s.t))
from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
! QUERY PLAN
! ------------------------------------------------------------------------------------------------------------------------------------------------
! Finalize Aggregate (cost=97.49..97.51 rows=1 width=4) (actual time=409.330..409.331 rows=1 loops=1)
! -> Gather (cost=97.28..97.49 rows=2 width=32) (actual time=376.957..395.841 rows=3 loops=1)
! Workers Planned: 2
! Workers Launched: 2
! -> Partial Aggregate (cost=97.28..97.29 rows=1 width=32) (actual time=254.370..254.373 rows=1 loops=3)
! -> Parallel Hash Left Join (cost=23.23..88.88 rows=3360 width=32) (actual time=240.812..241.297 rows=1 loops=3)
! Hash Cond: (wide.id = wide_1.id)
! -> Parallel Seq Scan on wide (cost=0.00..15.29 rows=529 width=4) (actual time=0.066..0.075 rows=1 loops=3)
! -> Parallel Hash (cost=16.61..16.61 rows=529 width=36) (actual time=109.565..109.565 rows=1 loops=3)
! Buckets: 1024 (originally 2048) Batches: 8 (originally 1) Memory Usage: 321kB
! -> Parallel Seq Scan on wide wide_1 (cost=0.00..16.61 rows=529 width=36) (actual time=2.989..7.218 rows=1 loops=3)
! Planning time: 1.317 ms
! Execution time: 424.395 ms
! (13 rows)

and here's a failing run:

! explain analyze
select length(max(s.t))
from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
! QUERY PLAN
! -------------------------------------------------------------------------------------------------------------------------------------------------
! Finalize Aggregate (cost=97.49..97.51 rows=1 width=4) (actual time=232.440..232.441 rows=1 loops=1)
! -> Gather (cost=97.28..97.49 rows=2 width=32) (actual time=225.738..226.744 rows=3 loops=1)
! Workers Planned: 2
! Workers Launched: 2
! -> Partial Aggregate (cost=97.28..97.29 rows=1 width=32) (actual time=29.377..29.379 rows=1 loops=3)
! -> Parallel Hash Left Join (cost=23.23..88.88 rows=3360 width=32) (actual time=22.747..25.340 rows=1 loops=3)
! Hash Cond: (wide.id = wide_1.id)
! -> Parallel Seq Scan on wide (cost=0.00..15.29 rows=529 width=4) (actual time=0.086..0.113 rows=2 loops=1)
! -> Parallel Hash (cost=16.61..16.61 rows=529 width=36) (actual time=16.382..16.382 rows=1 loops=3)
! Buckets: 1024 (originally 2048) Batches: 1 (originally 1) Memory Usage: 0kB
! -> Parallel Seq Scan on wide wide_1 (cost=0.00..16.61 rows=529 width=36) (actual time=9.167..21.301 rows=2 loops=1)
! Planning time: 1.289 ms
! Execution time: 243.120 ms
! (13 rows)

I don't have enough insight to be totally sure what this means, but the
"Memory Usage: 0kB" bit is obviously bogus, so I'd venture that at least
part of the issue is failure to return stats from a worker. I also find
it most curious that the "success" case is a lot slower than the "not
success" case. Perhaps this is related to your livelock issue? Doing
another run, I get something even slower:

! explain analyze
select length(max(s.t))
from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
! QUERY PLAN
! -------------------------------------------------------------------------------------------------------------------------------------------------
! Finalize Aggregate (cost=97.49..97.51 rows=1 width=4) (actual time=487.245..487.246 rows=1 loops=1)
! -> Gather (cost=97.28..97.49 rows=2 width=32) (actual time=444.650..475.390 rows=3 loops=1)
! Workers Planned: 2
! Workers Launched: 2
! -> Partial Aggregate (cost=97.28..97.29 rows=1 width=32) (actual time=345.816..345.819 rows=1 loops=3)
! -> Parallel Hash Left Join (cost=23.23..88.88 rows=3360 width=32) (actual time=334.229..338.098 rows=1 loops=3)
! Hash Cond: (wide.id = wide_1.id)
! -> Parallel Seq Scan on wide (cost=0.00..15.29 rows=529 width=4) (actual time=0.065..0.074 rows=1 loops=3)
! -> Parallel Hash (cost=16.61..16.61 rows=529 width=36) (actual time=140.210..140.210 rows=1 loops=3)
! Buckets: 1024 (originally 2048) Batches: 8 (originally 1) Memory Usage: 321kB
! -> Parallel Seq Scan on wide wide_1 (cost=0.00..16.61 rows=529 width=36) (actual time=4.233..15.117 rows=1 loops=3)
! Planning time: 1.380 ms
! Execution time: 509.607 ms
! (13 rows)

Aside from the instability problems, I'm pretty unhappy about how much
the PHJ patch has added to the runtime of "make check". I do not think
any one feature can justify adding 20% to that. Can't you cut down the
amount of data processed by these new test cases?

regards, tom lane

#15Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#14)
Re: pgsql: Add parallel-aware hash joins.

On Thu, Dec 28, 2017 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

I'll address the instability of the regression test output separately.

If you're still looking for data on that --- prairiedog is able to
reproduce the "multibatch = f" variant about one time in thirty.
I modified the test case to print out the full EXPLAIN ANALYZE output
rather than a heavily filtered version. Here's a typical successful run:

! Buckets: 1024 (originally 2048) Batches: 8 (originally 1) Memory Usage: 321kB
! Execution time: 424.395 ms

and here's a failing run:

! Buckets: 1024 (originally 2048) Batches: 1 (originally 1) Memory Usage: 0kB
! Execution time: 243.120 ms

I don't have enough insight to be totally sure what this means, but the
"Memory Usage: 0kB" bit is obviously bogus, so I'd venture that at least
part of the issue is failure to return stats from a worker.

Hmm. Yeah, that seems quite likely -- thanks. Investigating now.

I also find
it most curious that the "success" case is a lot slower than the "not
success" case. Perhaps this is related to your livelock issue? Doing
another run, I get something even slower:

! Buckets: 1024 (originally 2048) Batches: 8 (originally 1) Memory Usage: 321kB
! Execution time: 509.607 ms

Yes, that looks a lot like the ConditionVariableBroadcast() livelock
problem. That query with the same 8-batch 2-worker plan runs in ~12ms
and ~19ms for me on two different machines here without the run-to-run
variation you see.

Aside from the instability problems, I'm pretty unhappy about how much
the PHJ patch has added to the runtime of "make check". I do not think
any one feature can justify adding 20% to that. Can't you cut down the
amount of data processed by these new test cases?

Isn't that mostly because of the CV livelock problem? Here are the
real times I got with "time make check" on my slow 2 core Intel(R)
Celeron(R) CPU G1610T @ 2.30GHz running FreeBSD, which seems to be
prone to that problem on certain queries:

84940644 = 34.83s (before hash join regression tests)
fa330f9a = 35.81s (hash join regression tests)
18042840 = 44.92s (parallel-aware hash joins + new tests)

So the PHJ patch made it take 25% longer, similar to what you
reported. But then if I apply the following band-aid kludge to
condition_variable.c to limit the looping, I get:

-       while (ConditionVariableSignal(cv))
+       while (nwoken < max_parallel_workers && ConditionVariableSignal(cv))
                ++nwoken;

18042840 + kludge = 36.70s

So without the effects of that bug it's only taking 2.4% longer than
commit fa330f9a. Is that acceptable for a feature of this size and
complexity? I will also look into making the data sets smaller.

--
Thomas Munro
http://www.enterprisedb.com

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#15)
Re: pgsql: Add parallel-aware hash joins.

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Thu, Dec 28, 2017 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Aside from the instability problems, I'm pretty unhappy about how much
the PHJ patch has added to the runtime of "make check". I do not think
any one feature can justify adding 20% to that. Can't you cut down the
amount of data processed by these new test cases?

Isn't that mostly because of the CV livelock problem?

Hm, maybe. I quoted the 20% figure on the basis of longfin's reports,
not prairiedog's ... but it might be seeing some of the livelock effect
too.

So without the effects of that bug it's only taking 2.4% longer than
commit fa330f9a. Is that acceptable for a feature of this size and
complexity? I will also look into making the data sets smaller.

That sounds better, but it's still worth asking whether the tests
could be quicker.

regards, tom lane

#17Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#15)
Re: pgsql: Add parallel-aware hash joins.

On Thu, Dec 28, 2017 at 5:15 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Dec 28, 2017 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

! Buckets: 1024 (originally 2048) Batches: 1 (originally 1) Memory Usage: 0kB
! Execution time: 243.120 ms

I don't have enough insight to be totally sure what this means, but the
"Memory Usage: 0kB" bit is obviously bogus, so I'd venture that at least
part of the issue is failure to return stats from a worker.

Hmm. Yeah, that seems quite likely -- thanks. Investigating now.

This is explained by the early exit case in
ExecParallelHashEnsureBatchAccessors(). With just the right timing,
it finishes up not reporting the true nbatch number, and never calling
ExecParallelHashUpdateSpacePeak().

In my patch for commit 5bcf389e (before PHJ), I had extracted and
rejiggered some parts of my PHJ work to fix a problem with EXPLAIN for
parallel-oblivious hash joins running in parallel queries, but I
failed to readapt it properly for PHJ. EXPLAIN needs to scan all
participants' HashInstrumentation to collect the greatest space
report, not just the first one it finds. I'll test and post a patch
to fix this tomorrow.

--
Thomas Munro
http://www.enterprisedb.com

#18Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#17)
Re: pgsql: Add parallel-aware hash joins.

On Fri, Dec 29, 2017 at 2:21 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Dec 28, 2017 at 5:15 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Dec 28, 2017 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

! Buckets: 1024 (originally 2048) Batches: 1 (originally 1) Memory Usage: 0kB
! Execution time: 243.120 ms

I don't have enough insight to be totally sure what this means, but the
"Memory Usage: 0kB" bit is obviously bogus, so I'd venture that at least
part of the issue is failure to return stats from a worker.

Hmm. Yeah, that seems quite likely -- thanks. Investigating now.

This is explained by the early exit case in
ExecParallelHashEnsureBatchAccessors(). With just the right timing,
it finishes up not reporting the true nbatch number, and never calling
ExecParallelHashUpdateSpacePeak().

Hi Tom,

You mentioned that prairiedog sees the problem about one time in
thirty. Would you mind checking if it goes away with this patch
applied?

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

fix-phj-explain.patchapplication/octet-stream; name=fix-phj-explain.patchDownload+60-51
#19Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#18)
Re: pgsql: Add parallel-aware hash joins.

Hi,

On 2017-12-31 02:51:26 +1300, Thomas Munro wrote:

You mentioned that prairiedog sees the problem about one time in
thirty. Would you mind checking if it goes away with this patch
applied?

--
Thomas Munro
http://www.enterprisedb.com

From cbed027275039cc5debf8db89342a133a831c9ca Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Sun, 31 Dec 2017 02:03:07 +1300
Subject: [PATCH] Fix EXPLAIN ANALYZE output for Parallel Hash.

In a race case, EXPLAIN ANALYZE could fail to display correct nbatch and size
information. Refactor so that participants report only on batches they worked
on rather than trying to report on all of them, and teach explain.c to
consider the HashInstrumentation object from all participants instead of
picking the first one it can find. This should fix an occasional build farm
failure in the "join" regression test.

This seems buggy independent of whether it fixes the issue on prairedog,
right? So I'm inclined to go ahead and just fix it...

+	/*
+	 * Merge results from workers.  In the parallel-oblivious case, the
+	 * results from all participants should be identical, except where
+	 * participants didn't run the join at all so have no data.  In the
+	 * parallel-aware case, we need to aggregate the results.  Each worker may
+	 * have seen a different subset of batches and we want to report the peak
+	 * memory usage across all batches.
+	 */

It's not necessarily the peak though, right? The largest batches might
not be read in at the same time. I'm fine with approximating it as such,
just want to make sure I understand.

+ if (hashstate->shared_info)
{
SharedHashInfo *shared_info = hashstate->shared_info;
int i;

-		/* Find the first worker that built a hash table. */
for (i = 0; i < shared_info->num_workers; ++i)
{
-			if (shared_info->hinstrument[i].nbatch > 0)
+			HashInstrumentation *worker_hi = &shared_info->hinstrument[i];
+
+			if (worker_hi->nbatch > 0)
{
-				hinstrument = &shared_info->hinstrument[i];
-				break;
+				/*
+				 * Every participant should agree on the buckets, so to be
+				 * sure we have a value we'll just overwrite each time.
+				 */
+				hinstrument.nbuckets = worker_hi->nbuckets;
+				hinstrument.nbuckets_original = worker_hi->nbuckets_original;
+				/*
+				 * Normally every participant should agree on the number of
+				 * batches too, but it's possible for a backend that started
+				 * late and missed the whole join not to have the final nbatch
+				 * number.  So we'll take the largest number.
+				 */
+				hinstrument.nbatch = Max(hinstrument.nbatch, worker_hi->nbatch);
+				hinstrument.nbatch_original = worker_hi->nbatch_original;
+				/*
+				 * In a parallel-aware hash join, for now we report the
+				 * maximum peak memory reported by any worker.
+				 */
+				hinstrument.space_peak =
+					Max(hinstrument.space_peak, worker_hi->space_peak);

I bet pgindent will not like this layout.

Ho hum. Is this really, as you say above, an "aggregate the results"?

Greetings,

Andres Freund

#20Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#19)
Re: pgsql: Add parallel-aware hash joins.

On Sun, Dec 31, 2017 at 5:16 AM, Andres Freund <andres@anarazel.de> wrote:

In a race case, EXPLAIN ANALYZE could fail to display correct nbatch and size
information. Refactor so that participants report only on batches they worked
on rather than trying to report on all of them, and teach explain.c to
consider the HashInstrumentation object from all participants instead of
picking the first one it can find. This should fix an occasional build farm
failure in the "join" regression test.

This seems buggy independent of whether it fixes the issue on prairedog,
right? So I'm inclined to go ahead and just fix it...

+1

+     /*
+      * Merge results from workers.  In the parallel-oblivious case, the
+      * results from all participants should be identical, except where
+      * participants didn't run the join at all so have no data.  In the
+      * parallel-aware case, we need to aggregate the results.  Each worker may
+      * have seen a different subset of batches and we want to report the peak
+      * memory usage across all batches.
+      */

It's not necessarily the peak though, right? The largest batches might
not be read in at the same time. I'm fine with approximating it as such,
just want to make sure I understand.

Yeah, it's not attempting to report the true simultaneous peak memory
usage. It's only reporting the largest individual hash table ever
loaded. In a multi-batch join more than one hash table may be loaded
at the same time -- up to the number of participants -- but I'm not
yet attempting to reflect that. On the one hand, that's a bit like
the way we show the size for parallel-oblivious hash joins: each
participant used the reported amount of memory at approximately the
same time. On the other hand, the total simultaneous memory usage for
parallel-aware hash join is capped by both nbatch and nparticipants:
the true simultaneous peak must be <= largest_hash_table * Min(nbatch,
nparticipants). I considered various ways to capture and report this
(see the 0007 patch in the v26 patchset, which showed per-worker
information separately, but I abandoned that patch because it was
useless and confusing; another idea would be to report the sum of the
nparticipants largest hash tables, or just assume all batches are
similarly sized and use the formula I gave above, and another would be
to actually track which hash tables or memory regions that were
simultaneously loaded with an incremental shared counter maintained
when hash chunks and bucket arrays are allocated and freed), but
figured we should just go with something super simple for now and then
discuss better ideas as a later evolution.

[code]

I bet pgindent will not like this layout.

pgindented.

Ho hum. Is this really, as you say above, an "aggregate the results"?

Yeah, misleading/stupid use of "aggregate" (SQL MAX() is an
aggregate...). Offending word removed.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

fix-phj-explain-v2.patchapplication/octet-stream; name=fix-phj-explain-v2.patchDownload+62-52
#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#18)
#22Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#22)
#24Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#20)
#25Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#24)
#26Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#23)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#26)
#28Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#28)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#29)
#31Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#30)
#32Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#30)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#32)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#34)
#36Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#35)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#36)
#38Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#37)
#39Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#38)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#39)
#41Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#40)
#42Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#35)
#43Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#16)
#44Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#43)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#44)
#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#45)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#46)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#47)
#49Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#48)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#48)
#51Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#47)
#52Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#50)
#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#52)
#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#53)
#55Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#53)
#56Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#51)
#57Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Thomas Munro (#49)
#58Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#56)
#59Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#56)
#60Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#58)
#61Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#59)
#62Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#60)
#63Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#61)
#64Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#53)
#65Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#62)
#66Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#43)
#67Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#46)
#68Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#67)
#69Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#68)
#70Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#66)
#71Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#67)
#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#64)
#73Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#72)
#74Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#71)
#75Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#74)
#76Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#75)
#77Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#76)
#78Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#77)
#79Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#78)
#80Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#79)
#81Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#79)
#82Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#81)
#83Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#76)
#84Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#83)
#85Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#84)
#86Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#85)
#87Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#86)
#88Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#87)
#89Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#88)
#90Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#89)