pgsql: Add parallel-aware hash joins.

Started by Andres Freundabout 8 years ago90 messages
#1Andres Freund
andres@anarazel.de

Add parallel-aware hash joins.

Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.

After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.

The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:

* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs

One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.

A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.

A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.

Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
/messages/by-id/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
/messages/by-id/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/1804284042e659e7d16904e7bbb0ad546394b6a3

Modified Files
--------------
doc/src/sgml/config.sgml | 15 +
doc/src/sgml/monitoring.sgml | 62 +-
src/backend/executor/execParallel.c | 21 +
src/backend/executor/execProcnode.c | 3 +
src/backend/executor/nodeHash.c | 1659 +++++++++++++++++++++++--
src/backend/executor/nodeHashjoin.c | 617 ++++++++-
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/path/costsize.c | 25 +-
src/backend/optimizer/path/joinpath.c | 36 +-
src/backend/optimizer/plan/createplan.c | 11 +
src/backend/optimizer/util/pathnode.c | 5 +-
src/backend/postmaster/pgstat.c | 45 +
src/backend/utils/misc/guc.c | 10 +-
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/executor/hashjoin.h | 175 ++-
src/include/executor/nodeHash.h | 24 +-
src/include/executor/nodeHashjoin.h | 6 +
src/include/nodes/execnodes.h | 6 +
src/include/nodes/plannodes.h | 1 +
src/include/nodes/relation.h | 2 +
src/include/optimizer/cost.h | 4 +-
src/include/optimizer/pathnode.h | 1 +
src/include/pgstat.h | 15 +
src/include/storage/lwlock.h | 1 +
src/test/regress/expected/join.out | 304 ++++-
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/join.sql | 148 ++-
src/tools/pgindent/typedefs.list | 4 +
30 files changed, 3091 insertions(+), 116 deletions(-)

#2Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: pgsql: Add parallel-aware hash joins.

On 2017-12-21 08:49:46 +0000, Andres Freund wrote:

Add parallel-aware hash joins.

There's to relatively mundane failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2017-12-21%2008%3A48%3A12
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2017-12-21%2008%3A50%3A08

but also one that's a lot more interesting:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=capybara&dt=2017-12-21%2008%3A50%3A08

which shows an assert failure:

#2 0x00000000008687d1 in ExceptionalCondition (conditionName=conditionName@entry=0xa76a98 "!(!accessor->sts->participants[i].writing)", errorType=errorType@entry=0x8b2c49 "FailedAssertion", fileName=fileName@entry=0xa76991 "sharedtuplestore.c", lineNumber=lineNumber@entry=273) at assert.c:54
#3 0x000000000089883e in sts_begin_parallel_scan (accessor=0xfaf780) at sharedtuplestore.c:273
#4 0x0000000000634de4 in ExecParallelHashRepartitionRest (hashtable=0xfaec18) at nodeHash.c:1369
#5 ExecParallelHashIncreaseNumBatches (hashtable=0xfaec18) at nodeHash.c:1198
#6 0x000000000063546b in ExecParallelHashTupleAlloc (hashtable=hashtable@entry=0xfaec18, size=40, shared=shared@entry=0x7ffee26a8868) at nodeHash.c:2778
#7 0x00000000006357c8 in ExecParallelHashTableInsert (hashtable=hashtable@entry=0xfaec18, slot=slot@entry=0xfa76f8, hashvalue=<optimized out>) at nodeHash.c:1696
#8 0x0000000000635b5f in MultiExecParallelHash (node=0xf7ebc8) at nodeHash.c:288
#9 MultiExecHash (node=node@entry=0xf7ebc8) at nodeHash.c:112

which seems to suggest that something in the state machine logic is
borked. ExecParallelHashIncreaseNumBatches() should've ensured that
everyone has called sts_end_write()...

Greetings,

Andres Freund

#3Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Andres Freund (#2)
Re: pgsql: Add parallel-aware hash joins.

On Thu, Dec 21, 2017 at 10:29 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-12-21 08:49:46 +0000, Andres Freund wrote:

Add parallel-aware hash joins.

There's to relatively mundane failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&amp;dt=2017-12-21%2008%3A48%3A12
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&amp;dt=2017-12-21%2008%3A50%3A08

Right, it looks like something takes more space on ppc systems causing
a batch increase that doesn't happen on amd64. I'll come back to
that.

but also one that's a lot more interesting:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=capybara&amp;dt=2017-12-21%2008%3A50%3A08

which shows an assert failure:

#2 0x00000000008687d1 in ExceptionalCondition (conditionName=conditionName@entry=0xa76a98 "!(!accessor->sts->participants[i].writing)", errorType=errorType@entry=0x8b2c49 "FailedAssertion", fileName=fileName@entry=0xa76991 "sharedtuplestore.c", lineNumber=lineNumber@entry=273) at assert.c:54
#3 0x000000000089883e in sts_begin_parallel_scan (accessor=0xfaf780) at sharedtuplestore.c:273
#4 0x0000000000634de4 in ExecParallelHashRepartitionRest (hashtable=0xfaec18) at nodeHash.c:1369
#5 ExecParallelHashIncreaseNumBatches (hashtable=0xfaec18) at nodeHash.c:1198
#6 0x000000000063546b in ExecParallelHashTupleAlloc (hashtable=hashtable@entry=0xfaec18, size=40, shared=shared@entry=0x7ffee26a8868) at nodeHash.c:2778
#7 0x00000000006357c8 in ExecParallelHashTableInsert (hashtable=hashtable@entry=0xfaec18, slot=slot@entry=0xfa76f8, hashvalue=<optimized out>) at nodeHash.c:1696
#8 0x0000000000635b5f in MultiExecParallelHash (node=0xf7ebc8) at nodeHash.c:288
#9 MultiExecHash (node=node@entry=0xf7ebc8) at nodeHash.c:112

which seems to suggest that something in the state machine logic is
borked. ExecParallelHashIncreaseNumBatches() should've ensured that
everyone has called sts_end_write()...

Hmm. This looks the same as the one-off single assertion failure that
I mentioned[1]/messages/by-id/CAEepm=0oE=yO0Kam86W1d-iJoasWByYkcrkDoJu6t5mRhFGHkQ@mail.gmail.com and had not been able to reproduce. Investigating.

[1]: /messages/by-id/CAEepm=0oE=yO0Kam86W1d-iJoasWByYkcrkDoJu6t5mRhFGHkQ@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

#4Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#2)
Re: pgsql: Add parallel-aware hash joins.

On 2017-12-21 01:29:40 -0800, Andres Freund wrote:

On 2017-12-21 08:49:46 +0000, Andres Freund wrote:

Add parallel-aware hash joins.

There's to relatively mundane failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&amp;dt=2017-12-21%2008%3A48%3A12
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&amp;dt=2017-12-21%2008%3A50%3A08

but also one that's a lot more interesting:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=capybara&amp;dt=2017-12-21%2008%3A50%3A08

which shows an assert failure:

#2 0x00000000008687d1 in ExceptionalCondition (conditionName=conditionName@entry=0xa76a98 "!(!accessor->sts->participants[i].writing)", errorType=errorType@entry=0x8b2c49 "FailedAssertion", fileName=fileName@entry=0xa76991 "sharedtuplestore.c", lineNumber=lineNumber@entry=273) at assert.c:54
#3 0x000000000089883e in sts_begin_parallel_scan (accessor=0xfaf780) at sharedtuplestore.c:273
#4 0x0000000000634de4 in ExecParallelHashRepartitionRest (hashtable=0xfaec18) at nodeHash.c:1369
#5 ExecParallelHashIncreaseNumBatches (hashtable=0xfaec18) at nodeHash.c:1198
#6 0x000000000063546b in ExecParallelHashTupleAlloc (hashtable=hashtable@entry=0xfaec18, size=40, shared=shared@entry=0x7ffee26a8868) at nodeHash.c:2778
#7 0x00000000006357c8 in ExecParallelHashTableInsert (hashtable=hashtable@entry=0xfaec18, slot=slot@entry=0xfa76f8, hashvalue=<optimized out>) at nodeHash.c:1696
#8 0x0000000000635b5f in MultiExecParallelHash (node=0xf7ebc8) at nodeHash.c:288
#9 MultiExecHash (node=node@entry=0xf7ebc8) at nodeHash.c:112

which seems to suggest that something in the state machine logic is
borked. ExecParallelHashIncreaseNumBatches() should've ensured that
everyone has called sts_end_write()...

Thomas, I wonder if the problem is that PHJ_GROW_BATCHES_ELECTING
updates, via ExecParallelHashJoinSetUpBatches(), HashJoinTable->nbatch,
while other backends also access ->nbatch in
ExecParallelHashCloseBatchAccessors(). Both happens after waiting for
the WAIT_EVENT_HASH_GROW_BATCHES_ELECTING phase.

That'd lead to ExecParallelHashCloseBatchAccessors() likely not finish
writing all batches (because nbatch < nbatch_old), which seems like it'd
explain this?

Greetings,

Andres Freund

#5Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#4)
Re: pgsql: Add parallel-aware hash joins.

On 2017-12-21 01:55:50 -0800, Andres Freund wrote:

On 2017-12-21 01:29:40 -0800, Andres Freund wrote:

On 2017-12-21 08:49:46 +0000, Andres Freund wrote:

Add parallel-aware hash joins.

There's to relatively mundane failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&amp;dt=2017-12-21%2008%3A48%3A12
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&amp;dt=2017-12-21%2008%3A50%3A08

but also one that's a lot more interesting:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=capybara&amp;dt=2017-12-21%2008%3A50%3A08

which shows an assert failure:

#2 0x00000000008687d1 in ExceptionalCondition (conditionName=conditionName@entry=0xa76a98 "!(!accessor->sts->participants[i].writing)", errorType=errorType@entry=0x8b2c49 "FailedAssertion", fileName=fileName@entry=0xa76991 "sharedtuplestore.c", lineNumber=lineNumber@entry=273) at assert.c:54
#3 0x000000000089883e in sts_begin_parallel_scan (accessor=0xfaf780) at sharedtuplestore.c:273
#4 0x0000000000634de4 in ExecParallelHashRepartitionRest (hashtable=0xfaec18) at nodeHash.c:1369
#5 ExecParallelHashIncreaseNumBatches (hashtable=0xfaec18) at nodeHash.c:1198
#6 0x000000000063546b in ExecParallelHashTupleAlloc (hashtable=hashtable@entry=0xfaec18, size=40, shared=shared@entry=0x7ffee26a8868) at nodeHash.c:2778
#7 0x00000000006357c8 in ExecParallelHashTableInsert (hashtable=hashtable@entry=0xfaec18, slot=slot@entry=0xfa76f8, hashvalue=<optimized out>) at nodeHash.c:1696
#8 0x0000000000635b5f in MultiExecParallelHash (node=0xf7ebc8) at nodeHash.c:288
#9 MultiExecHash (node=node@entry=0xf7ebc8) at nodeHash.c:112

which seems to suggest that something in the state machine logic is
borked. ExecParallelHashIncreaseNumBatches() should've ensured that
everyone has called sts_end_write()...

Thomas, I wonder if the problem is that PHJ_GROW_BATCHES_ELECTING
updates, via ExecParallelHashJoinSetUpBatches(), HashJoinTable->nbatch,
while other backends also access ->nbatch in
ExecParallelHashCloseBatchAccessors(). Both happens after waiting for
the WAIT_EVENT_HASH_GROW_BATCHES_ELECTING phase.

That'd lead to ExecParallelHashCloseBatchAccessors() likely not finish
writing all batches (because nbatch < nbatch_old), which seems like it'd
explain this?

Trying to debug this I found another issue. I'd placed a sleep(10) in
ExecParallelHashCloseBatchAccessors() and then ctrl-c'ed the server for
some reason. Segfault time:

#0 0x000055bfbac42539 in tas (lock=0x7fcd82ae14ac <error: Cannot access memory at address 0x7fcd82ae14ac>) at /home/andres/src/postgresql/src/include/storage/s_lock.h:228
#1 0x000055bfbac42b4d in ConditionVariableCancelSleep () at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:173
#2 0x000055bfba8e24ae in AbortTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:2478
#3 0x000055bfba8e4a2a in AbortOutOfAnyTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:4387
#4 0x000055bfba91ed97 in RemoveTempRelationsCallback (code=1, arg=0) at /home/andres/src/postgresql/src/backend/catalog/namespace.c:4034
#5 0x000055bfbac1bc90 in shmem_exit (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:228
#6 0x000055bfbac1bb67 in proc_exit_prepare (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:185
#7 0x000055bfbac1bacf in proc_exit (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:102
#8 0x000055bfbadbccf0 in errfinish (dummy=0) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:543
#9 0x000055bfbac4eda3 in ProcessInterrupts () at /home/andres/src/postgresql/src/backend/tcop/postgres.c:2917
#10 0x000055bfbac42a63 in ConditionVariableSleep (cv=0x7fcd82ae14ac, wait_event_info=134217742) at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:129
#11 0x000055bfbac18405 in BarrierArriveAndWait (barrier=0x7fcd82ae1494, wait_event_info=134217742) at /home/andres/src/postgresql/src/backend/storage/ipc/barrier.c:191
#12 0x000055bfbaa9361e in ExecParallelHashIncreaseNumBatches (hashtable=0x55bfbd0e11d0) at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:1191
#13 0x000055bfbaa962ef in ExecParallelHashTupleAlloc (hashtable=0x55bfbd0e11d0, size=40, shared=0x7ffda8967050) at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:2781
#14 0x000055bfbaa946e8 in ExecParallelHashTableInsert (hashtable=0x55bfbd0e11d0, slot=0x55bfbd089a80, hashvalue=3825063138) at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:1699
#15 0x000055bfbaa91d90 in MultiExecParallelHash (node=0x55bfbd089610) at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:288
#16 0x000055bfbaa919b9 in MultiExecHash (node=0x55bfbd089610) at /home/andres/src/postgresql/src/backend/executor/nodeHash.c:112
#17 0x000055bfbaa7a500 in MultiExecProcNode (node=0x55bfbd089610) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:502
#18 0x000055bfbaa98515 in ExecHashJoinImpl (parallel=1 '\001', pstate=0x55bfbd053d50) at /home/andres/src/postgresql/src/backend/executor/nodeHashjoin.c:291
#19 ExecParallelHashJoin (pstate=0x55bfbd053d50) at /home/andres/src/postgresql/src/backend/executor/nodeHashjoin.c:582
#20 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd053d50) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#21 0x000055bfbaa858c7 in ExecProcNode (node=0x55bfbd053d50) at /home/andres/src/postgresql/src/include/executor/executor.h:242
#22 0x000055bfbaa85d67 in fetch_input_tuple (aggstate=0x55bfbd053698) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:699
#23 0x000055bfbaa889b5 in agg_retrieve_direct (aggstate=0x55bfbd053698) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2355
#24 0x000055bfbaa8858e in ExecAgg (pstate=0x55bfbd053698) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2166
#25 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd053698) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#26 0x000055bfbaa90a4e in ExecProcNode (node=0x55bfbd053698) at /home/andres/src/postgresql/src/include/executor/executor.h:242
#27 0x000055bfbaa910d5 in gather_getnext (gatherstate=0x55bfbd053340) at /home/andres/src/postgresql/src/backend/executor/nodeGather.c:285
#28 0x000055bfbaa90f5f in ExecGather (pstate=0x55bfbd053340) at /home/andres/src/postgresql/src/backend/executor/nodeGather.c:216
#29 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd053340) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#30 0x000055bfbaa858c7 in ExecProcNode (node=0x55bfbd053340) at /home/andres/src/postgresql/src/include/executor/executor.h:242
#31 0x000055bfbaa85d67 in fetch_input_tuple (aggstate=0x55bfbd052c18) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:699
#32 0x000055bfbaa889b5 in agg_retrieve_direct (aggstate=0x55bfbd052c18) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2355
#33 0x000055bfbaa8858e in ExecAgg (pstate=0x55bfbd052c18) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:2166
#34 0x000055bfbaa7a424 in ExecProcNodeFirst (node=0x55bfbd052c18) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#35 0x000055bfbaa716c6 in ExecProcNode (node=0x55bfbd052c18) at /home/andres/src/postgresql/src/include/executor/executor.h:242
#36 0x000055bfbaa7404e in ExecutePlan (estate=0x55bfbd0529c8, planstate=0x55bfbd052c18, use_parallel_mode=1 '\001', operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection, dest=0x55bfbd121ae0, execute_once=1 '\001')
at /home/andres/src/postgresql/src/backend/executor/execMain.c:1718
#37 0x000055bfbaa71cba in standard_ExecutorRun (queryDesc=0x55bfbcf0fe68, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at /home/andres/src/postgresql/src/backend/executor/execMain.c:361
#38 0x000055bfbaa71ad4 in ExecutorRun (queryDesc=0x55bfbcf0fe68, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at /home/andres/src/postgresql/src/backend/executor/execMain.c:304
#39 0x000055bfbac52725 in PortalRunSelect (portal=0x55bfbcf56a48, forward=1 '\001', count=0, dest=0x55bfbd121ae0) at /home/andres/src/postgresql/src/backend/tcop/pquery.c:932
#40 0x000055bfbac523b8 in PortalRun (portal=0x55bfbcf56a48, count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001', dest=0x55bfbd121ae0, altdest=0x55bfbd121ae0, completionTag=0x7ffda8967840 "") at /home/andres/src/postgresql/src/backend/tcop/pquery.c:773
#41 0x000055bfbac4c0a1 in exec_simple_query (query_string=0x55bfbceefaf8 "select count(*) from simple r join bigger_than_it_looks s using (id);") at /home/andres/src/postgresql/src/backend/tcop/postgres.c:1120
#42 0x000055bfbac505e4 in PostgresMain (argc=1, argv=0x55bfbcf1d178, dbname=0x55bfbcf1cf30 "regression", username=0x55bfbceec588 "andres") at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4139
#43 0x000055bfbabac375 in BackendRun (port=0x55bfbcf120f0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4412
#44 0x000055bfbababa74 in BackendStartup (port=0x55bfbcf120f0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4084
#45 0x000055bfbaba7d49 in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1757
#46 0x000055bfbaba72d2 in PostmasterMain (argc=39, argv=0x55bfbcee9e90) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1365
#47 0x000055bfbaadb49d in main (argc=39, argv=0x55bfbcee9e90) at /home/andres/src/postgresql/src/backend/main/main.c:228

So, afaics no workers had yet attached, the leader accepted the cancel
interrupt, the dsm segments were destroyed, and as part of cleanup
cv_sleep_target was supposed to be reset, which fails, because it's
memory has since been freed. Looking at how that can happen.

Greetings,

Andres Freund

#6Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#5)
condition variable cleanup and subtransactions

Hi,

On 2017-12-21 02:42:25 -0800, Andres Freund wrote:

Trying to debug this I found another issue. I'd placed a sleep(10) in
ExecParallelHashCloseBatchAccessors() and then ctrl-c'ed the server for
some reason. Segfault time:

#0 0x000055bfbac42539 in tas (lock=0x7fcd82ae14ac <error: Cannot access memory at address 0x7fcd82ae14ac>) at /home/andres/src/postgresql/src/include/storage/s_lock.h:228
#1 0x000055bfbac42b4d in ConditionVariableCancelSleep () at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:173
#2 0x000055bfba8e24ae in AbortTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:2478
#3 0x000055bfba8e4a2a in AbortOutOfAnyTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:4387

So, afaics no workers had yet attached, the leader accepted the cancel
interrupt, the dsm segments were destroyed, and as part of cleanup
cv_sleep_target was supposed to be reset, which fails, because it's
memory has since been freed. Looking at how that can happen.

Oh. This seems to be a condition variable bug independent of PHJ. The
problem is that the DSM segment etc all get cleaned up in
*subtransaction* abort.

Afaict it's a bug that AbortTransaction() does
ConditionVariableCancelSleep() but AbortSubTransaction() does not,
despite the latter releasing dsm segments via
ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS).

Adding that seems to fix the crash.

This seems like something we need to backpatch.

Greetings,

Andres Freund

#7Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Andres Freund (#6)
Re: condition variable cleanup and subtransactions

On Fri, Dec 22, 2017 at 12:00 AM, Andres Freund <andres@anarazel.de> wrote:

Oh. This seems to be a condition variable bug independent of PHJ. The
problem is that the DSM segment etc all get cleaned up in
*subtransaction* abort.

Afaict it's a bug that AbortTransaction() does
ConditionVariableCancelSleep() but AbortSubTransaction() does not,
despite the latter releasing dsm segments via
ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS).

Adding that seems to fix the crash.

This seems like something we need to backpatch.

Agreed. That affects any user of condition variables inside DSM
segments, including the released Parallel Index Scan and Parallel
Bitmap Heap Scan code.

--
Thomas Munro
http://www.enterprisedb.com

#8Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Andres Freund (#4)
Re: pgsql: Add parallel-aware hash joins.

On Thu, Dec 21, 2017 at 10:55 PM, Andres Freund <andres@anarazel.de> wrote:

Thomas, I wonder if the problem is that PHJ_GROW_BATCHES_ELECTING
updates, via ExecParallelHashJoinSetUpBatches(), HashJoinTable->nbatch,
while other backends also access ->nbatch in
ExecParallelHashCloseBatchAccessors(). Both happens after waiting for
the WAIT_EVENT_HASH_GROW_BATCHES_ELECTING phase.

That'd lead to ExecParallelHashCloseBatchAccessors() likely not finish
writing all batches (because nbatch < nbatch_old), which seems like it'd
explain this?

I don't think that's quite it, because it should never have set
'writing' for any batch number >= nbatch.

It's late here, but I'll take this up tomorrow and either find a fix
or figure out how to avoid antisocial noise levels on the build farm
in the meantime.

--
Thomas Munro
http://www.enterprisedb.com

#9Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#7)
Re: condition variable cleanup and subtransactions

On Thu, Dec 21, 2017 at 6:46 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Fri, Dec 22, 2017 at 12:00 AM, Andres Freund <andres@anarazel.de> wrote:

Oh. This seems to be a condition variable bug independent of PHJ. The
problem is that the DSM segment etc all get cleaned up in
*subtransaction* abort.

Afaict it's a bug that AbortTransaction() does
ConditionVariableCancelSleep() but AbortSubTransaction() does not,
despite the latter releasing dsm segments via
ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS).

Adding that seems to fix the crash.

This seems like something we need to backpatch.

Agreed. That affects any user of condition variables inside DSM
segments, including the released Parallel Index Scan and Parallel
Bitmap Heap Scan code.

Fixed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#2)
Re: pgsql: Add parallel-aware hash joins.

Andres Freund <andres@anarazel.de> writes:

On 2017-12-21 08:49:46 +0000, Andres Freund wrote:

Add parallel-aware hash joins.

There's to relatively mundane failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&amp;dt=2017-12-21%2008%3A48%3A12
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&amp;dt=2017-12-21%2008%3A50%3A08

[ as well as various assertion crashes ]

I see my machine prairiedog is showing yet another variation, possibly
related to the tern/termite failure, but not the same:

================= pgsql.build/src/test/regress/regression.diffs ===================
*** /Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/join.out	Thu Dec 21 03:55:15 2017
--- /Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/join.out	Thu Dec 21 10:56:00 2017
***************
*** 6682,6688 ****
  $$);
   multibatch 
  ------------
!  t
  (1 row)
  rollback to settings;
--- 6682,6688 ----
  $$);
   multibatch 
  ------------
!  f
  (1 row)

rollback to settings;

======================================================================

Let me know if you need me to poke into that.

regards, tom lane

#11Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#8)
Re: pgsql: Add parallel-aware hash joins.

On Fri, Dec 22, 2017 at 1:48 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I don't think that's quite it, because it should never have set
'writing' for any batch number >= nbatch.

It's late here, but I'll take this up tomorrow and either find a fix
or figure out how to avoid antisocial noise levels on the build farm
in the meantime.

Not there yet but I learned some things and am still working on it. I
spent a lot of time trying to reproduce the assertion failure, and
succeeded exactly once. Unfortunately the one time I managed do to
that I'd built with clang -O2 and got a core file that I couldn't get
much useful info out of, and I've been trying to do it again with -O0
ever since without luck. The time I succeeded, I reproduced it by
creating the tables "simple" and "bigger_than_it_looks" from join.sql
and then doing this in a loop:

set min_parallel_table_scan_size = 0;
set parallel_setup_cost = 0;
set work_mem = '192kB';

explain analyze select count(*) from simple r join
bigger_than_it_looks s using (id);

The machine that it happened on is resource constrained, and exhibits
another problem: though the above query normally runs in ~20ms,
sometimes it takes several seconds and occasionally much longer. That
never happens on fast development systems or test servers which run it
quickly every time, and it doesn't happen on my 2 core slow system if
I have only two workers (or one worker + leader). I dug into that and
figured out what was going wrong and wrote that up separately[1]/messages/by-id/CAEepm=0NWKehYw7NDoUSf8juuKOPRnCyY3vuaSvhrEWsOTAa3w@mail.gmail.com,
because I think it's an independent bug needing to be fixed, not the
root cause here. However, I think it could easily be contributing to
the timing required to trigger the bug we're looking for.

Andres, your machine francolin crashed -- got a core file?

[1]: /messages/by-id/CAEepm=0NWKehYw7NDoUSf8juuKOPRnCyY3vuaSvhrEWsOTAa3w@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

#12Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#11)
Re: pgsql: Add parallel-aware hash joins.

Hi,

On 2017-12-22 21:16:10 +1300, Thomas Munro wrote:

Andres, your machine francolin crashed -- got a core file?

Unfortunately not - it appears the buildfarm cleared it away :(

Might try to reproduce it on that machine...

Greetings,

Andres Freund

#13Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Andres Freund (#12)
1 attachment(s)
Re: pgsql: Add parallel-aware hash joins.

On Fri, Dec 22, 2017 at 9:22 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-12-22 21:16:10 +1300, Thomas Munro wrote:

Andres, your machine francolin crashed -- got a core file?

Unfortunately not - it appears the buildfarm cleared it away :(

I now have a workload that fails within a few minutes or so on master.
I see the problem: MultiExecParallelHash() needs to run
sts_end_write() *before* detaching from grow_batches_barrier. My test
case doesn't fail with the attached patch applied.

I'll address the instability of the regression test output separately.
Sorry for the delay, due to the time of year.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

fix-phj-writing-assertion-failure.patchapplication/octet-stream; name=fix-phj-writing-assertion-failure.patchDownload
From 24e096dc232f9f0aa1f1accd6a53ddafedbb19b6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Tue, 26 Dec 2017 22:44:00 +1300
Subject: [PATCH] Fix rare assertion failure in parallel hash join.

When a backend runs out of inner tuples to hash, it should detach from
grow_batch_barrier only after it has flushed all batches to disk and merged
counters, not before.  Otherwise a concurrent backend in
ExecParallelHashIncreaseNumBatches() could stop waiting for this backend and
try to read tuples before they have been written.  This commit reorders those
operations and should fix the assertion failures seen occasionally on the
build farm since commit 1804284042e659e7d16904e7bbb0ad546394b6a3.

Author: Thomas Munro
Discussion: https://postgr.es/m/E1eRwXy-0004IK-TO%40gemulon.postgresql.org
---
 src/backend/executor/nodeHash.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 0a519fae313..04eb3650aa3 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -288,8 +288,6 @@ MultiExecParallelHash(HashState *node)
 					ExecParallelHashTableInsert(hashtable, slot, hashvalue);
 				hashtable->partialTuples++;
 			}
-			BarrierDetach(&pstate->grow_buckets_barrier);
-			BarrierDetach(&pstate->grow_batches_barrier);
 
 			/*
 			 * Make sure that any tuples we wrote to disk are visible to
@@ -304,6 +302,9 @@ MultiExecParallelHash(HashState *node)
 			 */
 			ExecParallelHashMergeCounters(hashtable);
 
+			BarrierDetach(&pstate->grow_buckets_barrier);
+			BarrierDetach(&pstate->grow_batches_barrier);
+
 			/*
 			 * Wait for everyone to finish building and flushing files and
 			 * counters.
-- 
2.15.0

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#13)
Re: pgsql: Add parallel-aware hash joins.

Thomas Munro <thomas.munro@enterprisedb.com> writes:

I'll address the instability of the regression test output separately.

If you're still looking for data on that --- prairiedog is able to
reproduce the "multibatch = f" variant about one time in thirty.
I modified the test case to print out the full EXPLAIN ANALYZE output
rather than a heavily filtered version. Here's a typical successful run:

! explain analyze
select length(max(s.t))
from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
! QUERY PLAN
! ------------------------------------------------------------------------------------------------------------------------------------------------
! Finalize Aggregate (cost=97.49..97.51 rows=1 width=4) (actual time=409.330..409.331 rows=1 loops=1)
! -> Gather (cost=97.28..97.49 rows=2 width=32) (actual time=376.957..395.841 rows=3 loops=1)
! Workers Planned: 2
! Workers Launched: 2
! -> Partial Aggregate (cost=97.28..97.29 rows=1 width=32) (actual time=254.370..254.373 rows=1 loops=3)
! -> Parallel Hash Left Join (cost=23.23..88.88 rows=3360 width=32) (actual time=240.812..241.297 rows=1 loops=3)
! Hash Cond: (wide.id = wide_1.id)
! -> Parallel Seq Scan on wide (cost=0.00..15.29 rows=529 width=4) (actual time=0.066..0.075 rows=1 loops=3)
! -> Parallel Hash (cost=16.61..16.61 rows=529 width=36) (actual time=109.565..109.565 rows=1 loops=3)
! Buckets: 1024 (originally 2048) Batches: 8 (originally 1) Memory Usage: 321kB
! -> Parallel Seq Scan on wide wide_1 (cost=0.00..16.61 rows=529 width=36) (actual time=2.989..7.218 rows=1 loops=3)
! Planning time: 1.317 ms
! Execution time: 424.395 ms
! (13 rows)

and here's a failing run:

! explain analyze
select length(max(s.t))
from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
! QUERY PLAN
! -------------------------------------------------------------------------------------------------------------------------------------------------
! Finalize Aggregate (cost=97.49..97.51 rows=1 width=4) (actual time=232.440..232.441 rows=1 loops=1)
! -> Gather (cost=97.28..97.49 rows=2 width=32) (actual time=225.738..226.744 rows=3 loops=1)
! Workers Planned: 2
! Workers Launched: 2
! -> Partial Aggregate (cost=97.28..97.29 rows=1 width=32) (actual time=29.377..29.379 rows=1 loops=3)
! -> Parallel Hash Left Join (cost=23.23..88.88 rows=3360 width=32) (actual time=22.747..25.340 rows=1 loops=3)
! Hash Cond: (wide.id = wide_1.id)
! -> Parallel Seq Scan on wide (cost=0.00..15.29 rows=529 width=4) (actual time=0.086..0.113 rows=2 loops=1)
! -> Parallel Hash (cost=16.61..16.61 rows=529 width=36) (actual time=16.382..16.382 rows=1 loops=3)
! Buckets: 1024 (originally 2048) Batches: 1 (originally 1) Memory Usage: 0kB
! -> Parallel Seq Scan on wide wide_1 (cost=0.00..16.61 rows=529 width=36) (actual time=9.167..21.301 rows=2 loops=1)
! Planning time: 1.289 ms
! Execution time: 243.120 ms
! (13 rows)

I don't have enough insight to be totally sure what this means, but the
"Memory Usage: 0kB" bit is obviously bogus, so I'd venture that at least
part of the issue is failure to return stats from a worker. I also find
it most curious that the "success" case is a lot slower than the "not
success" case. Perhaps this is related to your livelock issue? Doing
another run, I get something even slower:

! explain analyze
select length(max(s.t))
from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
! QUERY PLAN
! -------------------------------------------------------------------------------------------------------------------------------------------------
! Finalize Aggregate (cost=97.49..97.51 rows=1 width=4) (actual time=487.245..487.246 rows=1 loops=1)
! -> Gather (cost=97.28..97.49 rows=2 width=32) (actual time=444.650..475.390 rows=3 loops=1)
! Workers Planned: 2
! Workers Launched: 2
! -> Partial Aggregate (cost=97.28..97.29 rows=1 width=32) (actual time=345.816..345.819 rows=1 loops=3)
! -> Parallel Hash Left Join (cost=23.23..88.88 rows=3360 width=32) (actual time=334.229..338.098 rows=1 loops=3)
! Hash Cond: (wide.id = wide_1.id)
! -> Parallel Seq Scan on wide (cost=0.00..15.29 rows=529 width=4) (actual time=0.065..0.074 rows=1 loops=3)
! -> Parallel Hash (cost=16.61..16.61 rows=529 width=36) (actual time=140.210..140.210 rows=1 loops=3)
! Buckets: 1024 (originally 2048) Batches: 8 (originally 1) Memory Usage: 321kB
! -> Parallel Seq Scan on wide wide_1 (cost=0.00..16.61 rows=529 width=36) (actual time=4.233..15.117 rows=1 loops=3)
! Planning time: 1.380 ms
! Execution time: 509.607 ms
! (13 rows)

Aside from the instability problems, I'm pretty unhappy about how much
the PHJ patch has added to the runtime of "make check". I do not think
any one feature can justify adding 20% to that. Can't you cut down the
amount of data processed by these new test cases?

regards, tom lane

#15Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#14)
Re: pgsql: Add parallel-aware hash joins.

On Thu, Dec 28, 2017 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

I'll address the instability of the regression test output separately.

If you're still looking for data on that --- prairiedog is able to
reproduce the "multibatch = f" variant about one time in thirty.
I modified the test case to print out the full EXPLAIN ANALYZE output
rather than a heavily filtered version. Here's a typical successful run:

! Buckets: 1024 (originally 2048) Batches: 8 (originally 1) Memory Usage: 321kB
! Execution time: 424.395 ms

and here's a failing run:

! Buckets: 1024 (originally 2048) Batches: 1 (originally 1) Memory Usage: 0kB
! Execution time: 243.120 ms

I don't have enough insight to be totally sure what this means, but the
"Memory Usage: 0kB" bit is obviously bogus, so I'd venture that at least
part of the issue is failure to return stats from a worker.

Hmm. Yeah, that seems quite likely -- thanks. Investigating now.

I also find
it most curious that the "success" case is a lot slower than the "not
success" case. Perhaps this is related to your livelock issue? Doing
another run, I get something even slower:

! Buckets: 1024 (originally 2048) Batches: 8 (originally 1) Memory Usage: 321kB
! Execution time: 509.607 ms

Yes, that looks a lot like the ConditionVariableBroadcast() livelock
problem. That query with the same 8-batch 2-worker plan runs in ~12ms
and ~19ms for me on two different machines here without the run-to-run
variation you see.

Aside from the instability problems, I'm pretty unhappy about how much
the PHJ patch has added to the runtime of "make check". I do not think
any one feature can justify adding 20% to that. Can't you cut down the
amount of data processed by these new test cases?

Isn't that mostly because of the CV livelock problem? Here are the
real times I got with "time make check" on my slow 2 core Intel(R)
Celeron(R) CPU G1610T @ 2.30GHz running FreeBSD, which seems to be
prone to that problem on certain queries:

84940644 = 34.83s (before hash join regression tests)
fa330f9a = 35.81s (hash join regression tests)
18042840 = 44.92s (parallel-aware hash joins + new tests)

So the PHJ patch made it take 25% longer, similar to what you
reported. But then if I apply the following band-aid kludge to
condition_variable.c to limit the looping, I get:

-       while (ConditionVariableSignal(cv))
+       while (nwoken < max_parallel_workers && ConditionVariableSignal(cv))
                ++nwoken;

18042840 + kludge = 36.70s

So without the effects of that bug it's only taking 2.4% longer than
commit fa330f9a. Is that acceptable for a feature of this size and
complexity? I will also look into making the data sets smaller.

--
Thomas Munro
http://www.enterprisedb.com

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#15)
Re: pgsql: Add parallel-aware hash joins.

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Thu, Dec 28, 2017 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Aside from the instability problems, I'm pretty unhappy about how much
the PHJ patch has added to the runtime of "make check". I do not think
any one feature can justify adding 20% to that. Can't you cut down the
amount of data processed by these new test cases?

Isn't that mostly because of the CV livelock problem?

Hm, maybe. I quoted the 20% figure on the basis of longfin's reports,
not prairiedog's ... but it might be seeing some of the livelock effect
too.

So without the effects of that bug it's only taking 2.4% longer than
commit fa330f9a. Is that acceptable for a feature of this size and
complexity? I will also look into making the data sets smaller.

That sounds better, but it's still worth asking whether the tests
could be quicker.

regards, tom lane

#17Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#15)
Re: pgsql: Add parallel-aware hash joins.

On Thu, Dec 28, 2017 at 5:15 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Dec 28, 2017 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

! Buckets: 1024 (originally 2048) Batches: 1 (originally 1) Memory Usage: 0kB
! Execution time: 243.120 ms

I don't have enough insight to be totally sure what this means, but the
"Memory Usage: 0kB" bit is obviously bogus, so I'd venture that at least
part of the issue is failure to return stats from a worker.

Hmm. Yeah, that seems quite likely -- thanks. Investigating now.

This is explained by the early exit case in
ExecParallelHashEnsureBatchAccessors(). With just the right timing,
it finishes up not reporting the true nbatch number, and never calling
ExecParallelHashUpdateSpacePeak().

In my patch for commit 5bcf389e (before PHJ), I had extracted and
rejiggered some parts of my PHJ work to fix a problem with EXPLAIN for
parallel-oblivious hash joins running in parallel queries, but I
failed to readapt it properly for PHJ. EXPLAIN needs to scan all
participants' HashInstrumentation to collect the greatest space
report, not just the first one it finds. I'll test and post a patch
to fix this tomorrow.

--
Thomas Munro
http://www.enterprisedb.com

#18Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#17)
1 attachment(s)
Re: pgsql: Add parallel-aware hash joins.

On Fri, Dec 29, 2017 at 2:21 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Dec 28, 2017 at 5:15 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Dec 28, 2017 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

! Buckets: 1024 (originally 2048) Batches: 1 (originally 1) Memory Usage: 0kB
! Execution time: 243.120 ms

I don't have enough insight to be totally sure what this means, but the
"Memory Usage: 0kB" bit is obviously bogus, so I'd venture that at least
part of the issue is failure to return stats from a worker.

Hmm. Yeah, that seems quite likely -- thanks. Investigating now.

This is explained by the early exit case in
ExecParallelHashEnsureBatchAccessors(). With just the right timing,
it finishes up not reporting the true nbatch number, and never calling
ExecParallelHashUpdateSpacePeak().

Hi Tom,

You mentioned that prairiedog sees the problem about one time in
thirty. Would you mind checking if it goes away with this patch
applied?

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

fix-phj-explain.patchapplication/octet-stream; name=fix-phj-explain.patchDownload
From cbed027275039cc5debf8db89342a133a831c9ca Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Sun, 31 Dec 2017 02:03:07 +1300
Subject: [PATCH] Fix EXPLAIN ANALYZE output for Parallel Hash.

In a race case, EXPLAIN ANALYZE could fail to display correct nbatch and size
information.  Refactor so that participants report only on batches they worked
on rather than trying to report on all of them, and teach explain.c to
consider the HashInstrumentation object from all participants instead of
picking the first one it can find.  This should fix an occasional build farm
failure in the "join" regression test.

Author: Thomas Munro
Discussion: https://postgr.es/m/30219.1514428346%40sss.pgh.pa.us
---
 src/backend/commands/explain.c      | 75 ++++++++++++++++++++++++-------------
 src/backend/executor/nodeHash.c     | 28 ++++++--------
 src/backend/executor/nodeHashjoin.c |  6 ---
 src/include/executor/nodeHash.h     |  1 -
 4 files changed, 60 insertions(+), 50 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7e4fbafc535..777c43b9e78 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2379,62 +2379,85 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 static void
 show_hash_info(HashState *hashstate, ExplainState *es)
 {
-	HashInstrumentation *hinstrument = NULL;
+	HashInstrumentation hinstrument = { 0 };
 
 	/*
 	 * In a parallel query, the leader process may or may not have run the
 	 * hash join, and even if it did it may not have built a hash table due to
 	 * timing (if it started late it might have seen no tuples in the outer
 	 * relation and skipped building the hash table).  Therefore we have to be
-	 * prepared to get instrumentation data from a worker if there is no hash
-	 * table.
+	 * prepared to get instrumentation data from all participants.
 	 */
 	if (hashstate->hashtable)
-	{
-		hinstrument = (HashInstrumentation *)
-			palloc(sizeof(HashInstrumentation));
-		ExecHashGetInstrumentation(hinstrument, hashstate->hashtable);
-	}
-	else if (hashstate->shared_info)
+		ExecHashGetInstrumentation(&hinstrument, hashstate->hashtable);
+
+	/*
+	 * Merge results from workers.  In the parallel-oblivious case, the
+	 * results from all participants should be identical, except where
+	 * participants didn't run the join at all so have no data.  In the
+	 * parallel-aware case, we need to aggregate the results.  Each worker may
+	 * have seen a different subset of batches and we want to report the peak
+	 * memory usage across all batches.
+	 */
+	if (hashstate->shared_info)
 	{
 		SharedHashInfo *shared_info = hashstate->shared_info;
 		int		i;
 
-		/* Find the first worker that built a hash table. */
 		for (i = 0; i < shared_info->num_workers; ++i)
 		{
-			if (shared_info->hinstrument[i].nbatch > 0)
+			HashInstrumentation *worker_hi = &shared_info->hinstrument[i];
+
+			if (worker_hi->nbatch > 0)
 			{
-				hinstrument = &shared_info->hinstrument[i];
-				break;
+				/*
+				 * Every participant should agree on the buckets, so to be
+				 * sure we have a value we'll just overwrite each time.
+				 */
+				hinstrument.nbuckets = worker_hi->nbuckets;
+				hinstrument.nbuckets_original = worker_hi->nbuckets_original;
+				/*
+				 * Normally every participant should agree on the number of
+				 * batches too, but it's possible for a backend that started
+				 * late and missed the whole join not to have the final nbatch
+				 * number.  So we'll take the largest number.
+				 */
+				hinstrument.nbatch = Max(hinstrument.nbatch, worker_hi->nbatch);
+				hinstrument.nbatch_original = worker_hi->nbatch_original;
+				/*
+				 * In a parallel-aware hash join, for now we report the
+				 * maximum peak memory reported by any worker.
+				 */
+				hinstrument.space_peak =
+					Max(hinstrument.space_peak, worker_hi->space_peak);
 			}
 		}
 	}
 
-	if (hinstrument)
+	if (hinstrument.nbatch > 0)
 	{
-		long		spacePeakKb = (hinstrument->space_peak + 1023) / 1024;
+		long		spacePeakKb = (hinstrument.space_peak + 1023) / 1024;
 
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
-			ExplainPropertyLong("Hash Buckets", hinstrument->nbuckets, es);
+			ExplainPropertyLong("Hash Buckets", hinstrument.nbuckets, es);
 			ExplainPropertyLong("Original Hash Buckets",
-								hinstrument->nbuckets_original, es);
-			ExplainPropertyLong("Hash Batches", hinstrument->nbatch, es);
+								hinstrument.nbuckets_original, es);
+			ExplainPropertyLong("Hash Batches", hinstrument.nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
-								hinstrument->nbatch_original, es);
+								hinstrument.nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hinstrument->nbatch_original != hinstrument->nbatch ||
-				 hinstrument->nbuckets_original != hinstrument->nbuckets)
+		else if (hinstrument.nbatch_original != hinstrument.nbatch ||
+				 hinstrument.nbuckets_original != hinstrument.nbuckets)
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
 							 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hinstrument->nbuckets,
-							 hinstrument->nbuckets_original,
-							 hinstrument->nbatch,
-							 hinstrument->nbatch_original,
+							 hinstrument.nbuckets,
+							 hinstrument.nbuckets_original,
+							 hinstrument.nbatch,
+							 hinstrument.nbatch_original,
 							 spacePeakKb);
 		}
 		else
@@ -2442,7 +2465,7 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
 							 "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
-							 hinstrument->nbuckets, hinstrument->nbatch,
+							 hinstrument.nbuckets, hinstrument.nbatch,
 							 spacePeakKb);
 		}
 	}
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 04eb3650aa3..7d5fcd0e559 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -3090,7 +3090,17 @@ ExecHashTableDetachBatch(HashJoinTable hashtable)
 				batch->buckets = InvalidDsaPointer;
 			}
 		}
-		ExecParallelHashUpdateSpacePeak(hashtable, curbatch);
+
+		/*
+		 * Track the largest batch we've been attached to.  Though each
+		 * backend might see a different subset of batches, explain.c knows
+		 * how to combine the results from different backends to find the
+		 * global peak.
+		 */
+		hashtable->spacePeak =
+			Max(hashtable->spacePeak,
+				batch->size + sizeof(dsa_pointer_atomic) * hashtable->nbuckets);
+
 		/* Remember that we are not attached to a batch. */
 		hashtable->curbatch = -1;
 	}
@@ -3295,19 +3305,3 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 
 	return true;
 }
-
-/*
- * Update this backend's copy of hashtable->spacePeak to account for a given
- * batch.  This is called at the end of hashing for batch 0, and then for each
- * batch when it is done or discovered to be already done.  The result is used
- * for EXPLAIN output.
- */
-void
-ExecParallelHashUpdateSpacePeak(HashJoinTable hashtable, int batchno)
-{
-	size_t		size;
-
-	size = hashtable->batches[batchno].shared->size;
-	size += sizeof(dsa_pointer_atomic) * hashtable->nbuckets;
-	hashtable->spacePeak = Max(hashtable->spacePeak, size);
-}
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 5d1dc1f401e..817bcf04713 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1186,12 +1186,6 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					 * remain).
 					 */
 					BarrierDetach(batch_barrier);
-
-					/*
-					 * We didn't work on this batch, but we need to observe
-					 * its size for EXPLAIN.
-					 */
-					ExecParallelHashUpdateSpacePeak(hashtable, batchno);
 					hashtable->batches[batchno].done = true;
 					hashtable->curbatch = -1;
 					break;
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 84c166b3951..367dfff018c 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -33,7 +33,6 @@ extern void ExecHashTableDetach(HashJoinTable hashtable);
 extern void ExecHashTableDetachBatch(HashJoinTable hashtable);
 extern void ExecParallelHashTableSetCurrentBatch(HashJoinTable hashtable,
 									 int batchno);
-void		ExecParallelHashUpdateSpacePeak(HashJoinTable hashtable, int batchno);
 
 extern void ExecHashTableInsert(HashJoinTable hashtable,
 					TupleTableSlot *slot,
-- 
2.15.0

#19Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#18)
Re: pgsql: Add parallel-aware hash joins.

Hi,

On 2017-12-31 02:51:26 +1300, Thomas Munro wrote:

You mentioned that prairiedog sees the problem about one time in
thirty. Would you mind checking if it goes away with this patch
applied?

--
Thomas Munro
http://www.enterprisedb.com

From cbed027275039cc5debf8db89342a133a831c9ca Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Sun, 31 Dec 2017 02:03:07 +1300
Subject: [PATCH] Fix EXPLAIN ANALYZE output for Parallel Hash.

In a race case, EXPLAIN ANALYZE could fail to display correct nbatch and size
information. Refactor so that participants report only on batches they worked
on rather than trying to report on all of them, and teach explain.c to
consider the HashInstrumentation object from all participants instead of
picking the first one it can find. This should fix an occasional build farm
failure in the "join" regression test.

This seems buggy independent of whether it fixes the issue on prairedog,
right? So I'm inclined to go ahead and just fix it...

+	/*
+	 * Merge results from workers.  In the parallel-oblivious case, the
+	 * results from all participants should be identical, except where
+	 * participants didn't run the join at all so have no data.  In the
+	 * parallel-aware case, we need to aggregate the results.  Each worker may
+	 * have seen a different subset of batches and we want to report the peak
+	 * memory usage across all batches.
+	 */

It's not necessarily the peak though, right? The largest batches might
not be read in at the same time. I'm fine with approximating it as such,
just want to make sure I understand.

+ if (hashstate->shared_info)
{
SharedHashInfo *shared_info = hashstate->shared_info;
int i;

-		/* Find the first worker that built a hash table. */
for (i = 0; i < shared_info->num_workers; ++i)
{
-			if (shared_info->hinstrument[i].nbatch > 0)
+			HashInstrumentation *worker_hi = &shared_info->hinstrument[i];
+
+			if (worker_hi->nbatch > 0)
{
-				hinstrument = &shared_info->hinstrument[i];
-				break;
+				/*
+				 * Every participant should agree on the buckets, so to be
+				 * sure we have a value we'll just overwrite each time.
+				 */
+				hinstrument.nbuckets = worker_hi->nbuckets;
+				hinstrument.nbuckets_original = worker_hi->nbuckets_original;
+				/*
+				 * Normally every participant should agree on the number of
+				 * batches too, but it's possible for a backend that started
+				 * late and missed the whole join not to have the final nbatch
+				 * number.  So we'll take the largest number.
+				 */
+				hinstrument.nbatch = Max(hinstrument.nbatch, worker_hi->nbatch);
+				hinstrument.nbatch_original = worker_hi->nbatch_original;
+				/*
+				 * In a parallel-aware hash join, for now we report the
+				 * maximum peak memory reported by any worker.
+				 */
+				hinstrument.space_peak =
+					Max(hinstrument.space_peak, worker_hi->space_peak);

I bet pgindent will not like this layout.

Ho hum. Is this really, as you say above, an "aggregate the results"?

Greetings,

Andres Freund

#20Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Andres Freund (#19)
1 attachment(s)
Re: pgsql: Add parallel-aware hash joins.

On Sun, Dec 31, 2017 at 5:16 AM, Andres Freund <andres@anarazel.de> wrote:

In a race case, EXPLAIN ANALYZE could fail to display correct nbatch and size
information. Refactor so that participants report only on batches they worked
on rather than trying to report on all of them, and teach explain.c to
consider the HashInstrumentation object from all participants instead of
picking the first one it can find. This should fix an occasional build farm
failure in the "join" regression test.

This seems buggy independent of whether it fixes the issue on prairedog,
right? So I'm inclined to go ahead and just fix it...

+1

+     /*
+      * Merge results from workers.  In the parallel-oblivious case, the
+      * results from all participants should be identical, except where
+      * participants didn't run the join at all so have no data.  In the
+      * parallel-aware case, we need to aggregate the results.  Each worker may
+      * have seen a different subset of batches and we want to report the peak
+      * memory usage across all batches.
+      */

It's not necessarily the peak though, right? The largest batches might
not be read in at the same time. I'm fine with approximating it as such,
just want to make sure I understand.

Yeah, it's not attempting to report the true simultaneous peak memory
usage. It's only reporting the largest individual hash table ever
loaded. In a multi-batch join more than one hash table may be loaded
at the same time -- up to the number of participants -- but I'm not
yet attempting to reflect that. On the one hand, that's a bit like
the way we show the size for parallel-oblivious hash joins: each
participant used the reported amount of memory at approximately the
same time. On the other hand, the total simultaneous memory usage for
parallel-aware hash join is capped by both nbatch and nparticipants:
the true simultaneous peak must be <= largest_hash_table * Min(nbatch,
nparticipants). I considered various ways to capture and report this
(see the 0007 patch in the v26 patchset, which showed per-worker
information separately, but I abandoned that patch because it was
useless and confusing; another idea would be to report the sum of the
nparticipants largest hash tables, or just assume all batches are
similarly sized and use the formula I gave above, and another would be
to actually track which hash tables or memory regions that were
simultaneously loaded with an incremental shared counter maintained
when hash chunks and bucket arrays are allocated and freed), but
figured we should just go with something super simple for now and then
discuss better ideas as a later evolution.

[code]

I bet pgindent will not like this layout.

pgindented.

Ho hum. Is this really, as you say above, an "aggregate the results"?

Yeah, misleading/stupid use of "aggregate" (SQL MAX() is an
aggregate...). Offending word removed.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

fix-phj-explain-v2.patchapplication/octet-stream; name=fix-phj-explain-v2.patchDownload
From 4fa8ad852064403a6553cc71c8de21d59d972f01 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Sun, 31 Dec 2017 02:03:07 +1300
Subject: [PATCH] Fix EXPLAIN ANALYZE output for Parallel Hash.

In a race case, EXPLAIN ANALYZE could fail to display correct nbatch and size
information.  Refactor so that participants report only on batches they worked
on rather than trying to report on all of them, and teach explain.c to
consider the HashInstrumentation object from all participants instead of
picking the first one it can find.  This should fix an occasional build farm
failure in the "join" regression test.

Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/30219.1514428346%40sss.pgh.pa.us
---
 src/backend/commands/explain.c      | 79 ++++++++++++++++++++++++-------------
 src/backend/executor/nodeHash.c     | 27 +++++--------
 src/backend/executor/nodeHashjoin.c |  6 ---
 src/include/executor/nodeHash.h     |  1 -
 4 files changed, 62 insertions(+), 51 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7e4fbafc535..2156385ac88 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2379,62 +2379,87 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 static void
 show_hash_info(HashState *hashstate, ExplainState *es)
 {
-	HashInstrumentation *hinstrument = NULL;
+	HashInstrumentation hinstrument = {0};
 
 	/*
 	 * In a parallel query, the leader process may or may not have run the
 	 * hash join, and even if it did it may not have built a hash table due to
 	 * timing (if it started late it might have seen no tuples in the outer
 	 * relation and skipped building the hash table).  Therefore we have to be
-	 * prepared to get instrumentation data from a worker if there is no hash
-	 * table.
+	 * prepared to get instrumentation data from all participants.
 	 */
 	if (hashstate->hashtable)
-	{
-		hinstrument = (HashInstrumentation *)
-			palloc(sizeof(HashInstrumentation));
-		ExecHashGetInstrumentation(hinstrument, hashstate->hashtable);
-	}
-	else if (hashstate->shared_info)
+		ExecHashGetInstrumentation(&hinstrument, hashstate->hashtable);
+
+	/*
+	 * Merge results from workers.  In the parallel-oblivious case, the
+	 * results from all participants should be identical, except where
+	 * participants didn't run the join at all so have no data.  In the
+	 * parallel-aware case, we need to consider all the results.  Each worker
+	 * may have seen a different subset of batches and we want to find the
+	 * highest memory usage for any one batch across all batches.
+	 */
+	if (hashstate->shared_info)
 	{
 		SharedHashInfo *shared_info = hashstate->shared_info;
-		int		i;
+		int			i;
 
-		/* Find the first worker that built a hash table. */
 		for (i = 0; i < shared_info->num_workers; ++i)
 		{
-			if (shared_info->hinstrument[i].nbatch > 0)
+			HashInstrumentation *worker_hi = &shared_info->hinstrument[i];
+
+			if (worker_hi->nbatch > 0)
 			{
-				hinstrument = &shared_info->hinstrument[i];
-				break;
+				/*
+				 * Every participant should agree on the buckets, so to be
+				 * sure we have a value we'll just overwrite each time.
+				 */
+				hinstrument.nbuckets = worker_hi->nbuckets;
+				hinstrument.nbuckets_original = worker_hi->nbuckets_original;
+
+				/*
+				 * Normally every participant should agree on the number of
+				 * batches too, but it's possible for a backend that started
+				 * late and missed the whole join not to have the final nbatch
+				 * number.  So we'll take the largest number.
+				 */
+				hinstrument.nbatch = Max(hinstrument.nbatch, worker_hi->nbatch);
+				hinstrument.nbatch_original = worker_hi->nbatch_original;
+
+				/*
+				 * In a parallel-aware hash join, for now we report the
+				 * maximum peak memory reported by any worker.
+				 */
+				hinstrument.space_peak =
+					Max(hinstrument.space_peak, worker_hi->space_peak);
 			}
 		}
 	}
 
-	if (hinstrument)
+	if (hinstrument.nbatch > 0)
 	{
-		long		spacePeakKb = (hinstrument->space_peak + 1023) / 1024;
+		long		spacePeakKb = (hinstrument.space_peak + 1023) / 1024;
 
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
-			ExplainPropertyLong("Hash Buckets", hinstrument->nbuckets, es);
+			ExplainPropertyLong("Hash Buckets", hinstrument.nbuckets, es);
 			ExplainPropertyLong("Original Hash Buckets",
-								hinstrument->nbuckets_original, es);
-			ExplainPropertyLong("Hash Batches", hinstrument->nbatch, es);
+								hinstrument.nbuckets_original, es);
+			ExplainPropertyLong("Hash Batches", hinstrument.nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
-								hinstrument->nbatch_original, es);
+								hinstrument.nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hinstrument->nbatch_original != hinstrument->nbatch ||
-				 hinstrument->nbuckets_original != hinstrument->nbuckets)
+		else if (hinstrument.nbatch_original != hinstrument.nbatch ||
+				 hinstrument.nbuckets_original != hinstrument.nbuckets)
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
 							 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hinstrument->nbuckets,
-							 hinstrument->nbuckets_original,
-							 hinstrument->nbatch,
-							 hinstrument->nbatch_original,
+							 hinstrument.nbuckets,
+							 hinstrument.nbuckets_original,
+							 hinstrument.nbatch,
+							 hinstrument.nbatch_original,
 							 spacePeakKb);
 		}
 		else
@@ -2442,7 +2467,7 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
 							 "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
-							 hinstrument->nbuckets, hinstrument->nbatch,
+							 hinstrument.nbuckets, hinstrument.nbatch,
 							 spacePeakKb);
 		}
 	}
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 04eb3650aa3..4e1a2806b55 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -3090,7 +3090,16 @@ ExecHashTableDetachBatch(HashJoinTable hashtable)
 				batch->buckets = InvalidDsaPointer;
 			}
 		}
-		ExecParallelHashUpdateSpacePeak(hashtable, curbatch);
+
+		/*
+		 * Track the largest batch we've been attached to.  Though each
+		 * backend might see a different subset of batches, explain.c will
+		 * scan the results from all backends to find the largest value.
+		 */
+		hashtable->spacePeak =
+			Max(hashtable->spacePeak,
+				batch->size + sizeof(dsa_pointer_atomic) * hashtable->nbuckets);
+
 		/* Remember that we are not attached to a batch. */
 		hashtable->curbatch = -1;
 	}
@@ -3295,19 +3304,3 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 
 	return true;
 }
-
-/*
- * Update this backend's copy of hashtable->spacePeak to account for a given
- * batch.  This is called at the end of hashing for batch 0, and then for each
- * batch when it is done or discovered to be already done.  The result is used
- * for EXPLAIN output.
- */
-void
-ExecParallelHashUpdateSpacePeak(HashJoinTable hashtable, int batchno)
-{
-	size_t		size;
-
-	size = hashtable->batches[batchno].shared->size;
-	size += sizeof(dsa_pointer_atomic) * hashtable->nbuckets;
-	hashtable->spacePeak = Max(hashtable->spacePeak, size);
-}
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 5d1dc1f401e..817bcf04713 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1186,12 +1186,6 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					 * remain).
 					 */
 					BarrierDetach(batch_barrier);
-
-					/*
-					 * We didn't work on this batch, but we need to observe
-					 * its size for EXPLAIN.
-					 */
-					ExecParallelHashUpdateSpacePeak(hashtable, batchno);
 					hashtable->batches[batchno].done = true;
 					hashtable->curbatch = -1;
 					break;
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 84c166b3951..367dfff018c 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -33,7 +33,6 @@ extern void ExecHashTableDetach(HashJoinTable hashtable);
 extern void ExecHashTableDetachBatch(HashJoinTable hashtable);
 extern void ExecParallelHashTableSetCurrentBatch(HashJoinTable hashtable,
 									 int batchno);
-void		ExecParallelHashUpdateSpacePeak(HashJoinTable hashtable, int batchno);
 
 extern void ExecHashTableInsert(HashJoinTable hashtable,
 					TupleTableSlot *slot,
-- 
2.15.0

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#18)
Re: pgsql: Add parallel-aware hash joins.

Thomas Munro <thomas.munro@enterprisedb.com> writes:

This is explained by the early exit case in
ExecParallelHashEnsureBatchAccessors(). With just the right timing,
it finishes up not reporting the true nbatch number, and never calling
ExecParallelHashUpdateSpacePeak().

Hi Tom,

You mentioned that prairiedog sees the problem about one time in
thirty. Would you mind checking if it goes away with this patch
applied?

I've run 55 cycles of "make installcheck" without seeing a failure
with this patch installed. That's not enough to be totally sure
of course, but I think this probably fixes it.

However ... I noticed that my other dinosaur gaur shows the other failure
mode we see in the buildfarm, the "increased_batches = t" diff, and
I can report that this patch does *not* help that. The underlying
EXPLAIN output goes from something like

! Finalize Aggregate (cost=823.85..823.86 rows=1 width=8) (actual time=1378.102..1378.105 rows=1 loops=1)
! -> Gather (cost=823.63..823.84 rows=2 width=8) (actual time=1377.909..1378.006 rows=3 loops=1)
! Workers Planned: 2
! Workers Launched: 2
! -> Partial Aggregate (cost=823.63..823.64 rows=1 width=8) (actual time=1280.298..1280.302 rows=1 loops=3)
! -> Parallel Hash Join (cost=387.50..802.80 rows=8333 width=0) (actual time=1070.179..1249.142 rows=6667 loops=3)
! Hash Cond: (r.id = s.id)
! -> Parallel Seq Scan on simple r (cost=0.00..250.33 rows=8333 width=4) (actual time=0.173..62.063 rows=6667 loops=3)
! -> Parallel Hash (cost=250.33..250.33 rows=8333 width=4) (actual time=454.305..454.305 rows=6667 loops=3)
! Buckets: 4096 Batches: 8 Memory Usage: 208kB
! -> Parallel Seq Scan on simple s (cost=0.00..250.33 rows=8333 width=4) (actual time=0.178..67.115 rows=6667 loops=3)
! Planning time: 1.861 ms
! Execution time: 1687.311 ms

to something like

! Finalize Aggregate (cost=823.85..823.86 rows=1 width=8) (actual time=1588.733..1588.737 rows=1 loops=1)
! -> Gather (cost=823.63..823.84 rows=2 width=8) (actual time=1588.529..1588.634 rows=3 loops=1)
! Workers Planned: 2
! Workers Launched: 2
! -> Partial Aggregate (cost=823.63..823.64 rows=1 width=8) (actual time=1492.631..1492.635 rows=1 loops=3)
! -> Parallel Hash Join (cost=387.50..802.80 rows=8333 width=0) (actual time=1270.309..1451.501 rows=6667 loops=3)
! Hash Cond: (r.id = s.id)
! -> Parallel Seq Scan on simple r (cost=0.00..250.33 rows=8333 width=4) (actual time=0.219..158.144 rows=6667 loops=3)
! -> Parallel Hash (cost=250.33..250.33 rows=8333 width=4) (actual time=634.614..634.614 rows=6667 loops=3)
! Buckets: 4096 (originally 4096) Batches: 16 (originally 8) Memory Usage: 176kB
! -> Parallel Seq Scan on simple s (cost=0.00..250.33 rows=8333 width=4) (actual time=0.182..120.074 rows=6667 loops=3)
! Planning time: 1.931 ms
! Execution time: 2219.417 ms

so again we have a case where the plan didn't change but the execution
behavior did. This isn't quite 100% reproducible on gaur/pademelon,
but it fails more often than not seems like, so I can poke into it
if you can say what info would be helpful.

regards, tom lane

#22Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#21)
Re: pgsql: Add parallel-aware hash joins.

On Sun, Dec 31, 2017 at 11:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

You mentioned that prairiedog sees the problem about one time in
thirty. Would you mind checking if it goes away with this patch
applied?

I've run 55 cycles of "make installcheck" without seeing a failure
with this patch installed. That's not enough to be totally sure
of course, but I think this probably fixes it.

Thanks!

However ... I noticed that my other dinosaur gaur shows the other failure
mode we see in the buildfarm, the "increased_batches = t" diff, and
I can report that this patch does *not* help that. The underlying
EXPLAIN output goes from something like

! Buckets: 4096 Batches: 8 Memory Usage: 208kB

to something like

! Buckets: 4096 (originally 4096) Batches: 16 (originally 8) Memory Usage: 176kB

so again we have a case where the plan didn't change but the execution
behavior did. This isn't quite 100% reproducible on gaur/pademelon,
but it fails more often than not seems like, so I can poke into it
if you can say what info would be helpful.

Right. That's apparently unrelated and is the last build-farm issue
on my list (so far). I had noticed that certain BF animals are prone
to that particular failure, and they mostly have architectures that I
don't have so a few things are probably just differently sized. At
first I thought I'd tweak the tests so that the parameters were always
stable, and I got as far as installing Debian on qemu-system-ppc (it
took a looong time to compile PostgreSQL), but that seems a bit cheap
and flimsy... better to fix the size estimation error.

I assume that what happens here is the planner's size estimation code
sometimes disagrees with Parallel Hash's chunk-based memory
accounting, even though in this case we had perfect tuple count and
tuple size information. In an earlier version of the patch set I
refactored the planner to be chunk-aware (even for parallel-oblivious
hash join), but later in the process I tried to simplify and shrink
the patch set and avoid making unnecessary changes to non-Parallel
Hash code paths. I think I'll need to make the planner aware of the
maximum amount of fragmentation possible when parallel-aware
(something like: up to one tuple's worth at the end of each chunk, and
up to one whole wasted chunk per participating backend). More soon.

--
Thomas Munro
http://www.enterprisedb.com

#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#22)
Re: pgsql: Add parallel-aware hash joins.

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Sun, Dec 31, 2017 at 11:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

... This isn't quite 100% reproducible on gaur/pademelon,
but it fails more often than not seems like, so I can poke into it
if you can say what info would be helpful.

Right. That's apparently unrelated and is the last build-farm issue
on my list (so far). I had noticed that certain BF animals are prone
to that particular failure, and they mostly have architectures that I
don't have so a few things are probably just differently sized. At
first I thought I'd tweak the tests so that the parameters were always
stable, and I got as far as installing Debian on qemu-system-ppc (it
took a looong time to compile PostgreSQL), but that seems a bit cheap
and flimsy... better to fix the size estimation error.

"Size estimation error"? Why do you think it's that? We have exactly
the same plan in both cases.

My guess is that what's happening is that one worker or the other ends
up processing the whole scan, or the vast majority of it, so that that
worker's hash table has to hold substantially more than half of the
tuples and thereby is forced to up the number of batches. I don't see
how you can expect to estimate that situation exactly; or if you do,
you'll be pessimizing the plan for cases where the split is more nearly
equal.

By this theory, the reason why certain BF members are more prone to the
failure is that they're single-processor machines, and perhaps have
kernels with relatively long scheduling quanta, so that it's more likely
that the worker that gets scheduled first is able to read the whole input
to the hash step.

I assume that what happens here is the planner's size estimation code
sometimes disagrees with Parallel Hash's chunk-based memory
accounting, even though in this case we had perfect tuple count and
tuple size information. In an earlier version of the patch set I
refactored the planner to be chunk-aware (even for parallel-oblivious
hash join), but later in the process I tried to simplify and shrink
the patch set and avoid making unnecessary changes to non-Parallel
Hash code paths. I think I'll need to make the planner aware of the
maximum amount of fragmentation possible when parallel-aware
(something like: up to one tuple's worth at the end of each chunk, and
up to one whole wasted chunk per participating backend). More soon.

I'm really dubious that trying to model the executor's space consumption
exactly is a good idea, even if it did fix this specific problem.
That would expend extra planner cycles and pose a continuing maintenance
gotcha.

regards, tom lane

#24Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#20)
Re: pgsql: Add parallel-aware hash joins.

On 2017-12-31 10:59:26 +1300, Thomas Munro wrote:

On Sun, Dec 31, 2017 at 5:16 AM, Andres Freund <andres@anarazel.de> wrote:

In a race case, EXPLAIN ANALYZE could fail to display correct nbatch and size
information. Refactor so that participants report only on batches they worked
on rather than trying to report on all of them, and teach explain.c to
consider the HashInstrumentation object from all participants instead of
picking the first one it can find. This should fix an occasional build farm
failure in the "join" regression test.

This seems buggy independent of whether it fixes the issue on prairedog,
right? So I'm inclined to go ahead and just fix it...

+1

Pushed your updated version.

Thanks,

Andres

#25Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Andres Freund (#24)
Re: pgsql: Add parallel-aware hash joins.

On Tue, Jan 2, 2018 at 11:42 AM, Andres Freund <andres@anarazel.de> wrote:

Pushed your updated version.

Thanks. That should leave just the test failures like this on certain machines:

*** 6103,6109 ****
$$);
initially_multibatch | increased_batches
----------------------+-------------------
! t | f
(1 row)

  rollback to settings;
--- 6103,6109 ----
  $$);
   initially_multibatch | increased_batches
  ----------------------+-------------------
!  t                    | t
  (1 row)

I'll post a patch that should fix that along with a reply to Tom's
most recent email tomorrow, when I'm back from my break.

--
Thomas Munro
http://www.enterprisedb.com

#26Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#23)
Re: pgsql: Add parallel-aware hash joins.

On Sun, Dec 31, 2017 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Right. That's apparently unrelated and is the last build-farm issue
on my list (so far). I had noticed that certain BF animals are prone
to that particular failure, and they mostly have architectures that I
don't have so a few things are probably just differently sized. At
first I thought I'd tweak the tests so that the parameters were always
stable, and I got as far as installing Debian on qemu-system-ppc (it
took a looong time to compile PostgreSQL), but that seems a bit cheap
and flimsy... better to fix the size estimation error.

"Size estimation error"? Why do you think it's that? We have exactly
the same plan in both cases.

I mean that ExecChooseHashTableSize() estimates the hash table size like this:

inner_rel_bytes = ntuples * tupsize;

... but then at execution time, in the Parallel Hash case, we do
memory accounting not in tuples but in chunks. The various
participants pack tuples into 32KB chunks, and they trigger an
increase in the number of batches when the total size of all chunks
happens to exceeds the memory budget. In this case they do so
unexpectedly due to that extra overhead at execution time that the
planner didn't account for. We happened to be close to the threshold,
in this case between choosing 8 batches and 16 batches, we can get it
wrong and have to increase nbatch at execution time.

Non-parallel Hash also has such fragmentation. There are headers +
extra space at the end of each chunk, especially the end of the final
chunk. But it doesn't matter there because the executor doesn't count
the overhead either. For Parallel Hash I count the overhead, because
it reduces IPC if I do all the accounting in 32KB chunks.

I'm torn between (1) posting a patch that teaches
ExecChooseHashTableSize() to estimate the worst case extra
fragmentation assuming all participants contribute an almost entirely
empty chunk at the end, and (2) just finding some parameters (ie tweak
work_mem or number of tuples) that will make this work on all
computers in the build farm. I think the former is the correct
solution. Another solution would be to teach the executor to discount
the overhead, but that seems hard and seems like it's travelling in
the wrong direction.

My guess is that what's happening is that one worker or the other ends
up processing the whole scan, or the vast majority of it, so that that
worker's hash table has to hold substantially more than half of the
tuples and thereby is forced to up the number of batches. I don't see
how you can expect to estimate that situation exactly; or if you do,
you'll be pessimizing the plan for cases where the split is more nearly
equal.

That sort of thing does indeed affect the size at execution. You can
see that run to run variation easily with a small join forced to use
Parallel Hash, so that there is a race to load tuples. You get a
larger size if more workers manage to load at least one tuple, due to
their final partially filled chunk.

There is also the question of this being underestimated on systems
without real atomics:

bucket_bytes = sizeof(HashJoinTuple) * nbuckets;

The real size at execution time is sizeof(dsa_pointer_atomic) *
nbuckets. I don't think that's responsible for this particular
underestimation problem because the bucket array is currently not
considered at execution time when deciding to increase batches -- it
should be, and I'll come back to those two problems separately.

--
Thomas Munro
http://www.enterprisedb.com

#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#26)
Re: pgsql: Add parallel-aware hash joins.

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Sun, Dec 31, 2017 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Size estimation error"? Why do you think it's that? We have exactly
the same plan in both cases.

I mean that ExecChooseHashTableSize() estimates the hash table size like this:
inner_rel_bytes = ntuples * tupsize;

... but then at execution time, in the Parallel Hash case, we do
memory accounting not in tuples but in chunks. The various
participants pack tuples into 32KB chunks, and they trigger an
increase in the number of batches when the total size of all chunks
happens to exceeds the memory budget. In this case they do so
unexpectedly due to that extra overhead at execution time that the
planner didn't account for. We happened to be close to the threshold,
in this case between choosing 8 batches and 16 batches, we can get it
wrong and have to increase nbatch at execution time.

If that's the issue, why doesn't the test fail every time on affected
platforms? There shouldn't be anything nondeterministic about the
number or size of tuples going into the hash table?

... You get a
larger size if more workers manage to load at least one tuple, due to
their final partially filled chunk.

Hm. That could do it, except it doesn't really account for the observed
result that slower single-processor machines seem more prone to the
bug. Surely they should be less likely to get multiple workers activated.

BTW, I'm seeing a few things that look bug-like about
ExecParallelHashTuplePrealloc, for instance why does it use just
"size" to decide if space_allowed is exceeded but if not then add the
typically-much-larger value "want + HASH_CHUNK_HEADER_SIZE" to
estimated_size. That clearly can allow estimated_size to get
significantly past space_allowed --- if it's not a bug, it at least
deserves a comment explaining why not. Another angle, which does not
apply to this test case but seems like a bug for real usage, is that
ExecParallelHashTuplePrealloc doesn't account correctly for tuples wider
than HASH_CHUNK_THRESHOLD.

I'm also wondering why the non-parallel path seems to prefer to allocate
in units of HASH_CHUNK_SIZE + HASH_CHUNK_HEADER_SIZE while the parallel
path targets allocations of exactly HASH_CHUNK_SIZE, and why there's such
inconsistency in whether tuples of exactly HASH_CHUNK_THRESHOLD bytes
are treated as "big" or not.

regards, tom lane

#28Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#27)
1 attachment(s)
Re: pgsql: Add parallel-aware hash joins.

On Wed, Jan 3, 2018 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

I mean that ExecChooseHashTableSize() estimates the hash table size like this:
inner_rel_bytes = ntuples * tupsize;

... but then at execution time, in the Parallel Hash case, we do
memory accounting not in tuples but in chunks. The various
participants pack tuples into 32KB chunks, and they trigger an
increase in the number of batches when the total size of all chunks
happens to exceeds the memory budget. In this case they do so
unexpectedly due to that extra overhead at execution time that the
planner didn't account for. We happened to be close to the threshold,
in this case between choosing 8 batches and 16 batches, we can get it
wrong and have to increase nbatch at execution time.

If that's the issue, why doesn't the test fail every time on affected
platforms? There shouldn't be anything nondeterministic about the
number or size of tuples going into the hash table?

... You get a
larger size if more workers manage to load at least one tuple, due to
their final partially filled chunk.

Hm. That could do it, except it doesn't really account for the observed
result that slower single-processor machines seem more prone to the
bug. Surely they should be less likely to get multiple workers activated.

I can reproduce the instability here with an -m32 build and this:

create table simple as
select generate_series(1, 20000) AS id,
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
alter table simple set (parallel_workers = 2);
analyze simple;

set parallel_setup_cost = 0;
set work_mem = '96kB';

explain analyze select count(*) from simple r join simple s using
(id);

It plans for 8 batches, and then usually but not always goes to 16 at
execution time depending on timing. It doesn't happen for me with
128kB (the setting used in the regression test), but I think the
affected BF machines are all 32 bit systems that have MAXIMUM_ALIGNOF
8 and therefore use a bit more space, whereas my machines have
MAXIMUM_ALIGNOF 4 in a 32 bit build, so that would explain the
different location of this unstable border. We could probably fix the
failures by simply increasing work_mem out of that zone, but I'm
hoping to fix the problem in a more principled way. More soon.

BTW, I'm seeing a few things that look bug-like about
ExecParallelHashTuplePrealloc, for instance why does it use just
"size" to decide if space_allowed is exceeded but if not then add the
typically-much-larger value "want + HASH_CHUNK_HEADER_SIZE" to
estimated_size. That clearly can allow estimated_size to get
significantly past space_allowed --- if it's not a bug, it at least
deserves a comment explaining why not.

Right. Fixed in the attached.

Another angle, which does not
apply to this test case but seems like a bug for real usage, is that
ExecParallelHashTuplePrealloc doesn't account correctly for tuples wider
than HASH_CHUNK_THRESHOLD.

Right. I'll address that separately.

I'm also wondering why the non-parallel path seems to prefer to allocate
in units of HASH_CHUNK_SIZE + HASH_CHUNK_HEADER_SIZE while the parallel
path targets allocations of exactly HASH_CHUNK_SIZE,

That is intentional: dsa.c sucks at allocating 32KB + a tiny bit
because it works in 4KB pages for large allocations, so I wanted to
make HASH_CHUNK_SIZE the total size that arrives into dsa_allocate().
The non-parallel path has similar problems on some libc
implementations, as we discussed over here:

/messages/by-id/29770.1511495642@sss.pgh.pa.us

and why there's such
inconsistency in whether tuples of exactly HASH_CHUNK_THRESHOLD bytes
are treated as "big" or not.

Right, that's inconsistent. Fixed in the attached.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

fix-phj-fencepost.patchapplication/octet-stream; name=fix-phj-fencepost.patchDownload
From 3228c906ce1aa3d8de2b2f3c0c10e6e509510e4a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 3 Jan 2018 16:01:40 +1300
Subject: [PATCH] Fix some fencepost errors in Parallel Hash memory allocation.

For consistency, ExecParallelHashTupleAlloc() shouldn't treat tuples of size
HASH_CHUNK_THRESHOLD as large tuples (only tuples larger than that).

ExecParallelHashTuplePrealloc() should check the whole size being preallocated
(ie typically a chunk), not the size requested by the caller, otherwise memory
budget exhaustion won't be detected until the next call.

Thomas Munro, based on code review from Tom Lane.
---
 src/backend/executor/nodeHash.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 52f5c0c26e0..7c94938ff8c 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -2740,7 +2740,7 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
 	 */
 	chunk = hashtable->current_chunk;
 	if (chunk != NULL &&
-		size < HASH_CHUNK_THRESHOLD &&
+		size <= HASH_CHUNK_THRESHOLD &&
 		chunk->maxlen - chunk->used >= size)
 	{
 
@@ -3280,7 +3280,8 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 
 	if (pstate->growth != PHJ_GROWTH_DISABLED &&
 		batch->at_least_one_chunk &&
-		(batch->shared->estimated_size + size > pstate->space_allowed))
+		(batch->shared->estimated_size + want + HASH_CHUNK_HEADER_SIZE
+		 > pstate->space_allowed))
 	{
 		/*
 		 * We have determined that this batch would exceed the space budget if
-- 
2.15.0

#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#28)
Re: pgsql: Add parallel-aware hash joins.

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Wed, Jan 3, 2018 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hm. That could do it, except it doesn't really account for the observed
result that slower single-processor machines seem more prone to the
bug. Surely they should be less likely to get multiple workers activated.

It plans for 8 batches, and then usually but not always goes to 16 at
execution time depending on timing. It doesn't happen for me with
128kB (the setting used in the regression test), but I think the
affected BF machines are all 32 bit systems that have MAXIMUM_ALIGNOF
8 and therefore use a bit more space, whereas my machines have
MAXIMUM_ALIGNOF 4 in a 32 bit build, so that would explain the
different location of this unstable border.

Ah, that might do it. You're right that gaur/pademelon are in that class.

We could probably fix the
failures by simply increasing work_mem out of that zone, but I'm
hoping to fix the problem in a more principled way. More soon.

Meh. I think we're going to end up having to pick a modified test
case that's further away from the chunk size threshold. I do not
think it is possible to predict this runtime behavior exactly at
plan time, nor am I convinced that expending planner cycles on a
somewhat closer estimate is a useful use of planning time.

I'm also wondering why the non-parallel path seems to prefer to allocate
in units of HASH_CHUNK_SIZE + HASH_CHUNK_HEADER_SIZE while the parallel
path targets allocations of exactly HASH_CHUNK_SIZE,

That is intentional: dsa.c sucks at allocating 32KB + a tiny bit
because it works in 4KB pages for large allocations, so I wanted to
make HASH_CHUNK_SIZE the total size that arrives into dsa_allocate().
The non-parallel path has similar problems on some libc
implementations, as we discussed over here:
/messages/by-id/29770.1511495642@sss.pgh.pa.us

Oh, right, I'd forgotten about that discussion. I think it would be
good to adjust hashjoin so that both paths are targeting 32KB total;
but you are right that there's not a lot of point in fooling with the
non-parallel path until we add some way of accounting for aset.c's
overhead too. We can leave that for later.

BTW, I'm seeing a few things that look bug-like about
ExecParallelHashTuplePrealloc, ...

Right. Fixed in the attached.

Pushed, plus an additional Assert to clarify that we're expecting
ExecParallelHashTuplePrealloc's caller to have already maxalign'ed
the request size.

regards, tom lane

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#29)
Re: pgsql: Add parallel-aware hash joins.

So this patch has been in place for two weeks, the buildfarm has still
got the measles, and we're entering the January commitfest with a lot
of other work to get done. I realize that the two weeks were mostly
holiday time, but it's time to have some urgency about clearing the
buildfarm failures.

As best I understand, the remaining failures simply arise from
an unstable regression test case and aren't teaching us anything
of value. I propose that for the moment, we simply remove the
failing query from join.sql. We can put it back when there's
reason to think it will pass everywhere.

regards, tom lane

#31Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#30)
Re: pgsql: Add parallel-aware hash joins.

Hi,

On 2018-01-03 20:17:04 -0500, Tom Lane wrote:

So this patch has been in place for two weeks, the buildfarm has still
got the measles, and we're entering the January commitfest with a lot
of other work to get done. I realize that the two weeks were mostly
holiday time, but it's time to have some urgency about clearing the
buildfarm failures.

Agreed.

As best I understand, the remaining failures simply arise from
an unstable regression test case and aren't teaching us anything
of value.

I think that's the correct interpretation.

I propose that for the moment, we simply remove the
failing query from join.sql. We can put it back when there's
reason to think it will pass everywhere.

Alternatively we could temporarily push an alternative output. It
doesn't seem inconceivable we missed an actual bug due to the
noise. Don't have a strong opinion on which of the two to choose.

Unless somebody does something first, I'll disable the test tomorrow
pacific morning. But perhaps Thomas can up with an approach till then
that seems likely to pass on the different architectures. Seems quite
possible to just adjust the thresholds carefully?

Greetings,

Andres Freund

#32Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#30)
1 attachment(s)
Re: pgsql: Add parallel-aware hash joins.

On Thu, Jan 4, 2018 at 2:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So this patch has been in place for two weeks, the buildfarm has still
got the measles, and we're entering the January commitfest with a lot
of other work to get done. I realize that the two weeks were mostly
holiday time, but it's time to have some urgency about clearing the
buildfarm failures.

As best I understand, the remaining failures simply arise from
an unstable regression test case and aren't teaching us anything
of value. I propose that for the moment, we simply remove the
failing query from join.sql. We can put it back when there's
reason to think it will pass everywhere.

I spent a lot of time trying and failing to get the world's slowest 32
bit powerpc emulation to reproduce this. Bleugh. Before we rip that
test out, would you mind checking if this passes repeatedly on gaur or
pademelon?

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

tweak-phj-test.patchapplication/octet-stream; name=tweak-phj-test.patchDownload
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index a7cfdf1f443..02e7d56e550 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -6073,7 +6073,7 @@ rollback to settings;
 -- parallel with parallel-aware hash join
 savepoint settings;
 set local max_parallel_workers_per_gather = 2;
-set local work_mem = '128kB';
+set local work_mem = '192kB';
 set local enable_parallel_hash = on;
 explain (costs off)
   select count(*) from simple r join simple s using (id);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index a6a452f9609..dd62c38c15e 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -2116,7 +2116,7 @@ rollback to settings;
 -- parallel with parallel-aware hash join
 savepoint settings;
 set local max_parallel_workers_per_gather = 2;
-set local work_mem = '128kB';
+set local work_mem = '192kB';
 set local enable_parallel_hash = on;
 explain (costs off)
   select count(*) from simple r join simple s using (id);
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#32)
Re: pgsql: Add parallel-aware hash joins.

Thomas Munro <thomas.munro@enterprisedb.com> writes:

I spent a lot of time trying and failing to get the world's slowest 32
bit powerpc emulation to reproduce this. Bleugh. Before we rip that
test out, would you mind checking if this passes repeatedly on gaur or
pademelon?

Will do, but that machine is none too fast itself ... it'll be a few
hours before I can confirm anything.

regards, tom lane

#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#32)
Re: pgsql: Add parallel-aware hash joins.

Thomas Munro <thomas.munro@enterprisedb.com> writes:

I spent a lot of time trying and failing to get the world's slowest 32
bit powerpc emulation to reproduce this. Bleugh. Before we rip that
test out, would you mind checking if this passes repeatedly on gaur or
pademelon?

I ran a couple dozen test cycles on gaur without a failure. That's
not enough to really prove anything, but it's more successes than I was
getting before. I pushed the patch so we can see what the rest of the
buildfarm thinks.

regards, tom lane

#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#34)
Re: pgsql: Add parallel-aware hash joins.

I wrote:

I ran a couple dozen test cycles on gaur without a failure. That's
not enough to really prove anything, but it's more successes than I was
getting before. I pushed the patch so we can see what the rest of the
buildfarm thinks.

The early returns indicate that that problem is fixed; but now that the
noise level is down, it's possible to see that brolga is showing an actual
crash in the PHJ test, perhaps one time in four. So we're not out of
the woods yet. It seems to consistently look like this:

2017-12-21 17:34:52.092 EST [2252:4] LOG: background worker "parallel worker" (PID 3584) was terminated by signal 11
2017-12-21 17:34:52.092 EST [2252:5] DETAIL: Failed process was running: select count(*) from foo
left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
on foo.id < ss.id + 1 and foo.id > ss.id - 1;
2017-12-21 17:34:52.092 EST [2252:6] LOG: terminating any other active server processes

Also, what the devil is happening on skink?

regards, tom lane

#36Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#35)
Re: pgsql: Add parallel-aware hash joins.

On Thu, Jan 4, 2018 at 11:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, what the devil is happening on skink?

I looked at the server log for the first of the two skink failures.
The key lines seem to be:

2018-01-04 07:45:36.764 UTC [5a4ddb98.5a97:154] LOG: statement: DROP
SCHEMA testschema CASCADE;
2018-01-04 07:45:36.922 UTC [5a4ddb98.5a97:155] LOG: statement: DROP
ROLE regress_tablespace_user1;
2018-01-04 07:45:36.941 UTC [5a4ddb98.5a97:156] LOG: statement: DROP
ROLE regress_tablespace_user2;
2018-01-04 07:45:36.963 UTC [5a4ddb98.5a97:157] LOG: disconnection:
session time: 0:00:08.472 user=andres database=regression host=[local]
...lines from other backends indicating new connections...
2018-01-04 07:45:37.289 UTC [5a4ddb92.2b65:4] LOG: server process
(PID 23191) exited with exit code 128
2018-01-04 07:45:37.290 UTC [5a4ddb92.2b65:5] LOG: terminating any
other active server processes

So, skink is apparently dying during shutdown of a user-connected
backend, and specifically the one that executed the 'tablespace' test.
That test itself passed, but then afterward the backend fell over and
died horribly. Interestingly, I see multiple other messages in the
logs indicating similar failures:

2018-01-04 07:45:37.821 UTC [5a4ddb92.2b65:6] LOG: statistics
collector process (PID 20842) exited with exit code 128
2018-01-04 07:46:08.113 UTC [5a4ddb92.2b65:9] LOG: server process
(PID 29128) exited with exit code 128
2018-01-04 07:46:08.293 UTC [5a4ddb92.2b65:11] LOG: statistics
collector process (PID 29059) exited with exit code 128
2018-01-04 07:46:26.384 UTC [5a4ddb92.2b65:14] LOG: server process
(PID 13787) exited with exit code 128
2018-01-04 07:46:27.122 UTC [5a4ddb92.2b65:16] LOG: statistics
collector process (PID 13776) exited with exit code 128

This is a valgrind animal. I don't know if that might be related somehow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#36)
Re: pgsql: Add parallel-aware hash joins.

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jan 4, 2018 at 11:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, what the devil is happening on skink?

So, skink is apparently dying during shutdown of a user-connected
backend, and specifically the one that executed the 'tablespace' test.

Well, yeah, valgrind is burping: the postmaster log is full of

==10544== VALGRINDERROR-BEGIN
==10544== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
==10544== at 0x7011490: epoll_pwait (epoll_pwait.c:42)
==10544== by 0x4BF40B: WaitEventSetWaitBlock (latch.c:1048)
==10544== by 0x4BF40B: WaitEventSetWait (latch.c:1000)
==10544== by 0x3C0B3B: secure_read (be-secure.c:166)
==10544== by 0x3CCD9E: pq_recvbuf (pqcomm.c:963)
==10544== by 0x3CDA07: pq_getbyte (pqcomm.c:1006)
==10544== by 0x4E2A2D: SocketBackend (postgres.c:339)
==10544== by 0x4E444E: ReadCommand (postgres.c:512)
==10544== by 0x4E7588: PostgresMain (postgres.c:4085)
==10544== by 0x4641D0: BackendRun (postmaster.c:4412)
==10544== by 0x467308: BackendStartup (postmaster.c:4084)
==10544== by 0x4675F7: ServerLoop (postmaster.c:1757)
==10544== by 0x4689D4: PostmasterMain (postmaster.c:1365)
==10544== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==10544==
==10544== VALGRINDERROR-END

But (a) this is happening in multiple branches, and (b) we've not
changed anything near that code in awhile. I think something broke
in valgrind itself.

regards, tom lane

#38Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#37)
Re: pgsql: Add parallel-aware hash joins.

On 2018-01-04 12:11:37 -0500, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jan 4, 2018 at 11:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, what the devil is happening on skink?

So, skink is apparently dying during shutdown of a user-connected
backend, and specifically the one that executed the 'tablespace' test.

Well, yeah, valgrind is burping: the postmaster log is full of

==10544== VALGRINDERROR-BEGIN
==10544== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
==10544== at 0x7011490: epoll_pwait (epoll_pwait.c:42)
==10544== by 0x4BF40B: WaitEventSetWaitBlock (latch.c:1048)
==10544== by 0x4BF40B: WaitEventSetWait (latch.c:1000)
==10544== by 0x3C0B3B: secure_read (be-secure.c:166)
==10544== by 0x3CCD9E: pq_recvbuf (pqcomm.c:963)
==10544== by 0x3CDA07: pq_getbyte (pqcomm.c:1006)
==10544== by 0x4E2A2D: SocketBackend (postgres.c:339)
==10544== by 0x4E444E: ReadCommand (postgres.c:512)
==10544== by 0x4E7588: PostgresMain (postgres.c:4085)
==10544== by 0x4641D0: BackendRun (postmaster.c:4412)
==10544== by 0x467308: BackendStartup (postmaster.c:4084)
==10544== by 0x4675F7: ServerLoop (postmaster.c:1757)
==10544== by 0x4689D4: PostmasterMain (postmaster.c:1365)
==10544== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==10544==
==10544== VALGRINDERROR-END

But (a) this is happening in multiple branches, and (b) we've not
changed anything near that code in awhile. I think something broke
in valgrind itself.

Some packages on skink have been upgraded. It appears that there either
was a libc or valgrind change that made valgrind not recognize that a
pointer of 0 might not point anywhere :(

Let me check whether valgrind accept multiple suppression files, in
which case I could add a suppression for this error to all
branches. Will also check whether I can reproduce locally.

Greetings,

Andres Freund

#39Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#38)
Re: pgsql: Add parallel-aware hash joins.

On 2018-01-04 11:20:33 -0800, Andres Freund wrote:

On 2018-01-04 12:11:37 -0500, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jan 4, 2018 at 11:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, what the devil is happening on skink?

So, skink is apparently dying during shutdown of a user-connected
backend, and specifically the one that executed the 'tablespace' test.

Well, yeah, valgrind is burping: the postmaster log is full of

==10544== VALGRINDERROR-BEGIN
==10544== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
==10544== at 0x7011490: epoll_pwait (epoll_pwait.c:42)
==10544== by 0x4BF40B: WaitEventSetWaitBlock (latch.c:1048)
==10544== by 0x4BF40B: WaitEventSetWait (latch.c:1000)
==10544== by 0x3C0B3B: secure_read (be-secure.c:166)
==10544== by 0x3CCD9E: pq_recvbuf (pqcomm.c:963)
==10544== by 0x3CDA07: pq_getbyte (pqcomm.c:1006)
==10544== by 0x4E2A2D: SocketBackend (postgres.c:339)
==10544== by 0x4E444E: ReadCommand (postgres.c:512)
==10544== by 0x4E7588: PostgresMain (postgres.c:4085)
==10544== by 0x4641D0: BackendRun (postmaster.c:4412)
==10544== by 0x467308: BackendStartup (postmaster.c:4084)
==10544== by 0x4675F7: ServerLoop (postmaster.c:1757)
==10544== by 0x4689D4: PostmasterMain (postmaster.c:1365)
==10544== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==10544==
==10544== VALGRINDERROR-END

But (a) this is happening in multiple branches, and (b) we've not
changed anything near that code in awhile. I think something broke
in valgrind itself.

Some packages on skink have been upgraded. It appears that there either
was a libc or valgrind change that made valgrind not recognize that a
pointer of 0 might not point anywhere :(

It looks like recent glibc implements epoll_wait() as epoll_pwait() with
a NULL sigmask argument, and valgrind isn't happy with it.

Let me check whether valgrind accept multiple suppression files, in
which case I could add a suppression for this error to all
branches.

Done that. I've added a valgrind-global.supp that has:

{
null-sigmask-to-pwait-means-no-sigmask
Memcheck:Param
epoll_pwait(sigmask)
fun:epoll_pwait
fun:WaitEventSetWaitBlock
}

Will also check whether I can reproduce locally.

Locally the above suppression allowed me to continue a bit further. But
I promptly hit other new blurbs:

assert build:
2018-01-04 12:03:45.927 PST [5716][3/189] STATEMENT: select format('%1$s %4$s', 1, 2, 3);
==5718== Invalid write of size 8
==5718== at 0x40E3DD: ExecInterpExpr (execExprInterp.c:1117)
==5718== by 0x40F5E9: ExecInterpExprStillValid (execExprInterp.c:1788)
==5718== by 0xE6B0409: ExecEvalExpr (executor.h:282)
==5718== by 0xE6BA6EA: exec_eval_simple_expr (pl_exec.c:5602)
==5718== by 0xE6B9DBB: exec_eval_expr (pl_exec.c:5197)
==5718== by 0xE6B5A4B: exec_stmt_raise (pl_exec.c:3230)
==5718== by 0xE6B2C04: exec_stmt (pl_exec.c:1632)
==5718== by 0xE6B2956: exec_stmts (pl_exec.c:1535)
==5718== by 0xE6B27FC: exec_stmt_block (pl_exec.c:1473)
==5718== by 0xE6B083E: plpgsql_exec_function (pl_exec.c:474)
==5718== by 0xE6AB13C: plpgsql_inline_handler (pl_handler.c:330)
==5718== by 0x76C2B9: OidFunctionCall1Coll (fmgr.c:1327)
==5718== by 0x3A28CA: ExecuteDoStmt (functioncmds.c:2234)
==5718== by 0x5F92EB: standard_ProcessUtility (utility.c:532)
==5718== by 0x5F8E26: ProcessUtility (utility.c:357)
==5718== by 0x5F7D9F: PortalRunUtility (pquery.c:1178)
==5718== by 0x5F7FC0: PortalRunMulti (pquery.c:1324)
==5718== by 0x5F74DE: PortalRun (pquery.c:799)
==5718== by 0x5F10BE: exec_simple_query (postgres.c:1120)
==5718== by 0x5F5640: PostgresMain (postgres.c:4143)
==5718== Address 0x20 is not stack'd, malloc'd or (recently) free'd
==5718==
{
<insert_a_suppression_name_here>
Memcheck:Addr8
fun:ExecInterpExpr
fun:ExecInterpExprStillValid
fun:ExecEvalExpr
fun:exec_eval_simple_expr
fun:exec_eval_expr
fun:exec_stmt_raise
fun:exec_stmt
fun:exec_stmts
fun:exec_stmt_block
fun:plpgsql_exec_function
fun:plpgsql_inline_handler
fun:OidFunctionCall1Coll
fun:ExecuteDoStmt
fun:standard_ProcessUtility
fun:ProcessUtility
fun:PortalRunUtility
fun:PortalRunMulti
fun:PortalRun
fun:exec_simple_query
fun:PostgresMain
}
==5718==
==5718== Process terminating with default action of signal 11 (SIGSEGV)
==5718== Access not within mapped region at address 0x20
==5718== at 0x40E3DD: ExecInterpExpr (execExprInterp.c:1117)
==5718== by 0x40F5E9: ExecInterpExprStillValid (execExprInterp.c:1788)
==5718== by 0xE6B0409: ExecEvalExpr (executor.h:282)
==5718== by 0xE6BA6EA: exec_eval_simple_expr (pl_exec.c:5602)
==5718== by 0xE6B9DBB: exec_eval_expr (pl_exec.c:5197)
==5718== by 0xE6B5A4B: exec_stmt_raise (pl_exec.c:3230)
==5718== by 0xE6B2C04: exec_stmt (pl_exec.c:1632)
==5718== by 0xE6B2956: exec_stmts (pl_exec.c:1535)
==5718== by 0xE6B27FC: exec_stmt_block (pl_exec.c:1473)
==5718== by 0xE6B083E: plpgsql_exec_function (pl_exec.c:474)
==5718== by 0xE6AB13C: plpgsql_inline_handler (pl_handler.c:330)
==5718== by 0x76C2B9: OidFunctionCall1Coll (fmgr.c:1327)
==5718== by 0x3A28CA: ExecuteDoStmt (functioncmds.c:2234)
==5718== by 0x5F92EB: standard_ProcessUtility (utility.c:532)
==5718== by 0x5F8E26: ProcessUtility (utility.c:357)
==5718== by 0x5F7D9F: PortalRunUtility (pquery.c:1178)
==5718== by 0x5F7FC0: PortalRunMulti (pquery.c:1324)
==5718== by 0x5F74DE: PortalRun (pquery.c:799)
==5718== by 0x5F10BE: exec_simple_query (postgres.c:1120)
==5718== by 0x5F5640: PostgresMain (postgres.c:4143)
==5718== If you believe this happened as a result of a stack
==5718== overflow in your program's main thread (unlikely but
==5718== possible), you can try to increase the size of the
==5718== main thread stack using the --main-stacksize= flag.
==5718== The main thread stack size used in this run was 8388608.

This looks like it might be related to recent plpgsql changes. Looks
like some pointer in:
/* call input function (similar to InputFunctionCall) */
if (!op->d.iocoerce.finfo_in->fn_strict || str != NULL)
is NULL.

==4286== Conditional jump or move depends on uninitialised value(s)
==4286== at 0x6BA8A17: __wcsnlen_avx2 (strlen-avx2.S:261)
==4286== by 0x6AF2FF1: wcsrtombs (wcsrtombs.c:104)
==4286== by 0x6A88A40: wcstombs (wcstombs.c:34)
==4286== by 0x514772: lowerstr_with_len.part.1 (ts_locale.c:284)
==4286== by 0x523A63: readstoplist (ts_utils.c:119)
==4286== by 0xEAB2839: dsnowball_init (dict_snowball.c:201)
==4286== by 0x684975: OidFunctionCall1Coll (fmgr.c:1327)
==4286== by 0x679277: lookup_ts_dictionary_cache (ts_cache.c:337)
==4286== by 0x514E02: LexizeExec (ts_parse.c:203)
==4286== by 0x5153ED: parsetext (ts_parse.c:405)
==4286== by 0x521DD2: to_tsvector_byid (to_tsany.c:249)
==4286== by 0x380261: ExecInterpExpr (execExprInterp.c:664)
==4286== by 0x456069: ExecEvalExprSwitchContext (executor.h:301)
==4286== by 0x456069: evaluate_expr (clauses.c:4873)
==4286== by 0x4588B5: evaluate_function (clauses.c:4424)
==4286== by 0x4588B5: simplify_function (clauses.c:4064)
==4286== by 0x457519: eval_const_expressions_mutator (clauses.c:2676)
==4286== by 0x3D615A: expression_tree_mutator (nodeFuncs.c:2903)
==4286== by 0x458E43: simplify_function (clauses.c:4055)
==4286== by 0x457519: eval_const_expressions_mutator (clauses.c:2676)
==4286== by 0x3D5C02: expression_tree_mutator (nodeFuncs.c:2614)
==4286== by 0x4572C1: eval_const_expressions_mutator (clauses.c:3671)
==4286== Uninitialised value was created by a heap allocation
==4286== at 0x4C2BB76: malloc (vg_replace_malloc.c:299)
==4286== by 0x6A9667: AllocSetAlloc (aset.c:945)
==4286== by 0x6B5AC7: palloc (mcxt.c:848)
==4286== by 0x5238EA: get_tsearch_config_filename (ts_utils.c:55)
==4286== by 0x5239AF: readstoplist (ts_utils.c:75)
==4286== by 0xEAB2839: dsnowball_init (dict_snowball.c:201)
==4286== by 0x684975: OidFunctionCall1Coll (fmgr.c:1327)
==4286== by 0x679277: lookup_ts_dictionary_cache (ts_cache.c:337)
==4286== by 0x514E02: LexizeExec (ts_parse.c:203)
==4286== by 0x5153ED: parsetext (ts_parse.c:405)
==4286== by 0x521DD2: to_tsvector_byid (to_tsany.c:249)
==4286== by 0x380261: ExecInterpExpr (execExprInterp.c:664)
==4286== by 0x456069: ExecEvalExprSwitchContext (executor.h:301)
==4286== by 0x456069: evaluate_expr (clauses.c:4873)
==4286== by 0x4588B5: evaluate_function (clauses.c:4424)
==4286== by 0x4588B5: simplify_function (clauses.c:4064)
==4286== by 0x457519: eval_const_expressions_mutator (clauses.c:2676)
==4286== by 0x3D615A: expression_tree_mutator (nodeFuncs.c:2903)
==4286== by 0x458E43: simplify_function (clauses.c:4055)
==4286== by 0x457519: eval_const_expressions_mutator (clauses.c:2676)
==4286== by 0x3D5C02: expression_tree_mutator (nodeFuncs.c:2614)
==4286== by 0x4572C1: eval_const_expressions_mutator (clauses.c:3671)
==4286==
{
<insert_a_suppression_name_here>
Memcheck:Cond
fun:__wcsnlen_avx2
fun:wcsrtombs
fun:wcstombs
fun:lowerstr_with_len.part.1
fun:readstoplist
fun:dsnowball_init
fun:OidFunctionCall1Coll
fun:lookup_ts_dictionary_cache
fun:LexizeExec
fun:parsetext
fun:to_tsvector_byid
fun:ExecInterpExpr
fun:ExecEvalExprSwitchContext
fun:evaluate_expr
fun:evaluate_function
fun:simplify_function
fun:eval_const_expressions_mutator
fun:expression_tree_mutator
fun:simplify_function
fun:eval_const_expressions_mutator
fun:expression_tree_mutator
fun:eval_const_expressions_mutator
}

I don't recall any recent changes in the area, so this might be a glibc
internal issue.

Greetings,

Andres Freund

#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#39)
Re: pgsql: Add parallel-aware hash joins.

Andres Freund <andres@anarazel.de> writes:

On 2018-01-04 11:20:33 -0800, Andres Freund wrote:

Some packages on skink have been upgraded. It appears that there either
was a libc or valgrind change that made valgrind not recognize that a
pointer of 0 might not point anywhere :(

==5718== Invalid write of size 8
==5718== at 0x40E3DD: ExecInterpExpr (execExprInterp.c:1117)
==5718== by 0x40F5E9: ExecInterpExprStillValid (execExprInterp.c:1788)
==5718== by 0xE6B0409: ExecEvalExpr (executor.h:282)

Are those line numbers supposed to match current HEAD? 1117 does not
contain any write AFAICS:

if (!op->d.iocoerce.finfo_in->fn_strict || str != NULL)

==5718== Process terminating with default action of signal 11 (SIGSEGV)
==5718== Access not within mapped region at address 0x20
==5718== at 0x40E3DD: ExecInterpExpr (execExprInterp.c:1117)
==5718== by 0x40F5E9: ExecInterpExprStillValid (execExprInterp.c:1788)

This would be consistent with op->d.iocoerce.finfo_in being NULL, I think,
but there is no way that execExpr.c could have created an EEOP_IOCOERCE
step without filling in the finfo_in pointer.

I think the correct conclusion is that this version of valgrind is
buggy as hell. I wouldn't clutter our code with suppressions trying
to make it sort-of work.

regards, tom lane

#41Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#40)
Re: pgsql: Add parallel-aware hash joins.

On 2018-01-04 15:16:15 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2018-01-04 11:20:33 -0800, Andres Freund wrote:

Some packages on skink have been upgraded. It appears that there either
was a libc or valgrind change that made valgrind not recognize that a
pointer of 0 might not point anywhere :(

==5718== Invalid write of size 8
==5718== at 0x40E3DD: ExecInterpExpr (execExprInterp.c:1117)
==5718== by 0x40F5E9: ExecInterpExprStillValid (execExprInterp.c:1788)
==5718== by 0xE6B0409: ExecEvalExpr (executor.h:282)

Are those line numbers supposed to match current HEAD?

Well, 3e68686e2c55799234ecd020bd1621f913d65475, but that's pretty
close.

But I think it was just the result of a somehow not entirely clean
build, with plpgsql not being rebuilt when necessary. Not sure why/how,
but at least that means the epoll_pwait() suppression likely revive
skink to some degree.

I made sure that the valgrind version isn't affected (from mid last
year). After a forced rebuild survives a bit longer, but also encounters

==13750== Conditional jump or move depends on uninitialised value(s)
==13750== at 0x6BA888C: __wcsnlen_avx2 (strlen-avx2.S:103)
==13750== by 0x6AF2FF1: wcsrtombs (wcsrtombs.c:104)
==13750== by 0x6A88A40: wcstombs (wcstombs.c:34)
==13750== by 0x6BA76C: wchar2char (pg_locale.c:1641)
==13750== by 0x653A7C: str_tolower (formatting.c:1591)

and various other related issues. I'm not sure if there's a legitimate
formatting.c issue or whether it's glibc changes around string handling
triggering spurious blurbs.

Greetings,

Andres Freund

#42Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#35)
Re: pgsql: Add parallel-aware hash joins.

On Fri, Jan 5, 2018 at 5:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The early returns indicate that that problem is fixed;

Thanks for your help and patience with that. I've made a list over
here so we don't lose track of the various things that should be
improved in this area, and will start a new thread when I have patches
to propose: https://wiki.postgresql.org/wiki/Parallel_Hash

but now that the
noise level is down, it's possible to see that brolga is showing an actual
crash in the PHJ test, perhaps one time in four. So we're not out of
the woods yet. It seems to consistently look like this:

2017-12-21 17:34:52.092 EST [2252:4] LOG: background worker "parallel worker" (PID 3584) was terminated by signal 11
2017-12-21 17:34:52.092 EST [2252:5] DETAIL: Failed process was running: select count(*) from foo
left join (select b1.id, b1.t from bar b1 join bar b2 using (id)) ss
on foo.id < ss.id + 1 and foo.id > ss.id - 1;
2017-12-21 17:34:52.092 EST [2252:6] LOG: terminating any other active server processes

That is a test of a parallel-aware hash join with a rescan (ie workers
get restarted repeatedly by the gather node reusing the DSM; maybe I
misunderstood some detail of the protocol for that). I'll go and
review that code and try to reproduce the failure. On the off-chance,
Andrew, is there any chance you have a core dump you could pull a
backtrace out of, on brolga?

--
Thomas Munro
http://www.enterprisedb.com

#43Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#16)
2 attachment(s)
Re: pgsql: Add parallel-aware hash joins.

On Thu, Dec 28, 2017 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Thu, Dec 28, 2017 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Aside from the instability problems, I'm pretty unhappy about how much
the PHJ patch has added to the runtime of "make check". I do not think
any one feature can justify adding 20% to that. Can't you cut down the
amount of data processed by these new test cases?

Isn't that mostly because of the CV livelock problem?

Hm, maybe. I quoted the 20% figure on the basis of longfin's reports,
not prairiedog's ... but it might be seeing some of the livelock effect
too.

So without the effects of that bug it's only taking 2.4% longer than
commit fa330f9a. Is that acceptable for a feature of this size and
complexity? I will also look into making the data sets smaller.

That sounds better, but it's still worth asking whether the tests
could be quicker.

Here is a patch that halves the size of the test tables used. I don't
want them to be too small because I want some some real parallel
processing at least sometimes. On my slowest system with assertions
enabled this brings "time make check" from ~37.5s to ~36.9s. I'm a
bit worried about choosing parameters that lead to instability across
the buildfarm, until I make the size estimation code a bit smarter[1]https://wiki.postgresql.org/wiki/Parallel_Hash,
so I tested a few variations that affect the alignment and size of
things (-m32, -malign-double, --disable-atomics, --disable-spinlocks)
and didn't see failures. Does this pass repeatedly on gaur?

Since you mentioned longfin, I checked some historical "check"
runtimes and they seemed pretty erratic so I decided to bust out
gnuplot to try to see what's happening. See attached, which goes back
as far as I could web-scrape from the buildfarm. Three interesting
commits are shown with vertical lines, in this order from left to
right:

Commit fa330f9a "Add some regression tests that exercise hash join
code.", Nov 29.
Commit 18042840 "Add parallel-aware hash joins.", Dec 20
Commit aced5a92 "Rewrite ConditionVariableBroadcast() to avoid
live-lock.", Jan 5.

For comparison I also plotted data for two other machines whose times
are in a similar range. It looks to me like longfin's average and
variation increased half way between fa330f9a and 18042840, somewhere
near Dec 7 to 9, when we went from ~40s +/- 1 to ~50s with several
seconds' variation. Was there some other environmental change then on
that machine? Do you see any commit that could explain that -- maybe
0a3edbb33? Then maybe it also suffered a bit from livelock woes from
18042840 until aced5a92? It's hard to say and I'm not really sure what
to make of this, but it seems clear that you could draw the wrong
conclusion if you just picked a couple of random data points from that
animal, and that those other animals aren't affected too much by any
of these commits.

[1]: https://wiki.postgresql.org/wiki/Parallel_Hash

Attachments:

check-times.pngimage/png; name=check-times.pngDownload
join-test-shrink.patchapplication/octet-stream; name=join-test-shrink.patchDownload
From d5871edc4daaa5218aa627059afaa65ba39ea767 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Mon, 15 Jan 2018 15:34:35 +1300
Subject: [PATCH] Reduce the size of the tables used in the hash join
 regression tests.

Thomas Munro, per complaint from Tom Lane.
---
 src/test/regress/expected/join.out | 60 +++++++++++++++++++-------------------
 src/test/regress/sql/join.sql      | 28 +++++++++---------
 2 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 02e7d56e550..5f705b017c0 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5860,27 +5860,27 @@ $$;
 -- Make a simple relation with well distributed keys and correctly
 -- estimated size.
 create table simple as
-  select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+  select generate_series(1, 10000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
 alter table simple set (parallel_workers = 2);
 analyze simple;
 -- Make a relation whose size we will under-estimate.  We want stats
--- to say 1000 rows, but actually there are 20,000 rows.
+-- to say 1000 rows, but actually there are 10,000 rows.
 create table bigger_than_it_looks as
-  select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+  select generate_series(1, 10000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
 alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
 alter table bigger_than_it_looks set (parallel_workers = 2);
 analyze bigger_than_it_looks;
 update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
 -- Make a relation whose size we underestimate and that also has a
 -- kind of skew that breaks our batching scheme.  We want stats to say
--- 2 rows, but actually there are 20,000 rows with the same key.
+-- 2 rows, but actually there are 10,000 rows with the same key.
 create table extremely_skewed (id int, t text);
 alter table extremely_skewed set (autovacuum_enabled = 'false');
 alter table extremely_skewed set (parallel_workers = 2);
 analyze extremely_skewed;
 insert into extremely_skewed
   select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
-  from generate_series(1, 20000);
+  from generate_series(1, 10000);
 update pg_class
   set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
   where relname = 'extremely_skewed';
@@ -5909,7 +5909,7 @@ explain (costs off)
 select count(*) from simple r join simple s using (id);
  count 
 -------
- 20000
+ 10000
 (1 row)
 
 select original > 1 as initially_multibatch, final > original as increased_batches
@@ -5946,7 +5946,7 @@ explain (costs off)
 select count(*) from simple r join simple s using (id);
  count 
 -------
- 20000
+ 10000
 (1 row)
 
 select original > 1 as initially_multibatch, final > original as increased_batches
@@ -5983,7 +5983,7 @@ explain (costs off)
 select count(*) from simple r join simple s using (id);
  count 
 -------
- 20000
+ 10000
 (1 row)
 
 select original > 1 as initially_multibatch, final > original as increased_batches
@@ -6003,7 +6003,7 @@ rollback to settings;
 -- non-parallel
 savepoint settings;
 set local max_parallel_workers_per_gather = 0;
-set local work_mem = '128kB';
+set local work_mem = '64kB';
 explain (costs off)
   select count(*) from simple r join simple s using (id);
                QUERY PLAN               
@@ -6019,7 +6019,7 @@ explain (costs off)
 select count(*) from simple r join simple s using (id);
  count 
 -------
- 20000
+ 10000
 (1 row)
 
 select original > 1 as initially_multibatch, final > original as increased_batches
@@ -6036,7 +6036,7 @@ rollback to settings;
 -- parallel with parallel-oblivious hash join
 savepoint settings;
 set local max_parallel_workers_per_gather = 2;
-set local work_mem = '128kB';
+set local work_mem = '64kB';
 set local enable_parallel_hash = off;
 explain (costs off)
   select count(*) from simple r join simple s using (id);
@@ -6056,7 +6056,7 @@ explain (costs off)
 select count(*) from simple r join simple s using (id);
  count 
 -------
- 20000
+ 10000
 (1 row)
 
 select original > 1 as initially_multibatch, final > original as increased_batches
@@ -6073,7 +6073,7 @@ rollback to settings;
 -- parallel with parallel-aware hash join
 savepoint settings;
 set local max_parallel_workers_per_gather = 2;
-set local work_mem = '192kB';
+set local work_mem = '64kB';
 set local enable_parallel_hash = on;
 explain (costs off)
   select count(*) from simple r join simple s using (id);
@@ -6093,7 +6093,7 @@ explain (costs off)
 select count(*) from simple r join simple s using (id);
  count 
 -------
- 20000
+ 10000
 (1 row)
 
 select original > 1 as initially_multibatch, final > original as increased_batches
@@ -6114,7 +6114,7 @@ rollback to settings;
 -- non-parallel
 savepoint settings;
 set local max_parallel_workers_per_gather = 0;
-set local work_mem = '128kB';
+set local work_mem = '64kB';
 explain (costs off)
   select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
                       QUERY PLAN                      
@@ -6130,7 +6130,7 @@ explain (costs off)
 select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
  count 
 -------
- 20000
+ 10000
 (1 row)
 
 select original > 1 as initially_multibatch, final > original as increased_batches
@@ -6147,7 +6147,7 @@ rollback to settings;
 -- parallel with parallel-oblivious hash join
 savepoint settings;
 set local max_parallel_workers_per_gather = 2;
-set local work_mem = '128kB';
+set local work_mem = '64kB';
 set local enable_parallel_hash = off;
 explain (costs off)
   select count(*) from simple r join bigger_than_it_looks s using (id);
@@ -6167,7 +6167,7 @@ explain (costs off)
 select count(*) from simple r join bigger_than_it_looks s using (id);
  count 
 -------
- 20000
+ 10000
 (1 row)
 
 select original > 1 as initially_multibatch, final > original as increased_batches
@@ -6184,7 +6184,7 @@ rollback to settings;
 -- parallel with parallel-aware hash join
 savepoint settings;
 set local max_parallel_workers_per_gather = 1;
-set local work_mem = '192kB';
+set local work_mem = '64kB';
 set local enable_parallel_hash = on;
 explain (costs off)
   select count(*) from simple r join bigger_than_it_looks s using (id);
@@ -6204,7 +6204,7 @@ explain (costs off)
 select count(*) from simple r join bigger_than_it_looks s using (id);
  count 
 -------
- 20000
+ 10000
 (1 row)
 
 select original > 1 as initially_multibatch, final > original as increased_batches
@@ -6226,7 +6226,7 @@ rollback to settings;
 -- non-parallel
 savepoint settings;
 set local max_parallel_workers_per_gather = 0;
-set local work_mem = '128kB';
+set local work_mem = '64kB';
 explain (costs off)
   select count(*) from simple r join extremely_skewed s using (id);
                     QUERY PLAN                    
@@ -6242,7 +6242,7 @@ explain (costs off)
 select count(*) from simple r join extremely_skewed s using (id);
  count 
 -------
- 20000
+ 10000
 (1 row)
 
 select * from hash_join_batches(
@@ -6258,7 +6258,7 @@ rollback to settings;
 -- parallel with parallel-oblivious hash join
 savepoint settings;
 set local max_parallel_workers_per_gather = 2;
-set local work_mem = '128kB';
+set local work_mem = '64kB';
 set local enable_parallel_hash = off;
 explain (costs off)
   select count(*) from simple r join extremely_skewed s using (id);
@@ -6277,7 +6277,7 @@ explain (costs off)
 select count(*) from simple r join extremely_skewed s using (id);
  count 
 -------
- 20000
+ 10000
 (1 row)
 
 select * from hash_join_batches(
@@ -6293,7 +6293,7 @@ rollback to settings;
 -- parallel with parallel-aware hash join
 savepoint settings;
 set local max_parallel_workers_per_gather = 1;
-set local work_mem = '128kB';
+set local work_mem = '64kB';
 set local enable_parallel_hash = on;
 explain (costs off)
   select count(*) from simple r join extremely_skewed s using (id);
@@ -6313,7 +6313,7 @@ explain (costs off)
 select count(*) from simple r join extremely_skewed s using (id);
  count 
 -------
- 20000
+ 10000
 (1 row)
 
 select * from hash_join_batches(
@@ -6571,7 +6571,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s using (id);
  count 
 -------
- 20000
+ 10000
 (1 row)
 
 rollback to settings;
@@ -6593,7 +6593,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s using (id);
  count 
 -------
- 20000
+ 10000
 (1 row)
 
 rollback to settings;
@@ -6616,7 +6616,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
  count 
 -------
- 40000
+ 20000
 (1 row)
 
 rollback to settings;
@@ -6638,7 +6638,7 @@ explain (costs off)
 select  count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
  count 
 -------
- 40000
+ 20000
 (1 row)
 
 rollback to settings;
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index dd62c38c15e..149ddf4c23d 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -2001,14 +2001,14 @@ $$;
 -- Make a simple relation with well distributed keys and correctly
 -- estimated size.
 create table simple as
-  select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+  select generate_series(1, 10000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
 alter table simple set (parallel_workers = 2);
 analyze simple;
 
 -- Make a relation whose size we will under-estimate.  We want stats
--- to say 1000 rows, but actually there are 20,000 rows.
+-- to say 1000 rows, but actually there are 10,000 rows.
 create table bigger_than_it_looks as
-  select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+  select generate_series(1, 10000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
 alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
 alter table bigger_than_it_looks set (parallel_workers = 2);
 analyze bigger_than_it_looks;
@@ -2016,14 +2016,14 @@ update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
 
 -- Make a relation whose size we underestimate and that also has a
 -- kind of skew that breaks our batching scheme.  We want stats to say
--- 2 rows, but actually there are 20,000 rows with the same key.
+-- 2 rows, but actually there are 10,000 rows with the same key.
 create table extremely_skewed (id int, t text);
 alter table extremely_skewed set (autovacuum_enabled = 'false');
 alter table extremely_skewed set (parallel_workers = 2);
 analyze extremely_skewed;
 insert into extremely_skewed
   select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
-  from generate_series(1, 20000);
+  from generate_series(1, 10000);
 update pg_class
   set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
   where relname = 'extremely_skewed';
@@ -2087,7 +2087,7 @@ rollback to settings;
 -- non-parallel
 savepoint settings;
 set local max_parallel_workers_per_gather = 0;
-set local work_mem = '128kB';
+set local work_mem = '64kB';
 explain (costs off)
   select count(*) from simple r join simple s using (id);
 select count(*) from simple r join simple s using (id);
@@ -2101,7 +2101,7 @@ rollback to settings;
 -- parallel with parallel-oblivious hash join
 savepoint settings;
 set local max_parallel_workers_per_gather = 2;
-set local work_mem = '128kB';
+set local work_mem = '64kB';
 set local enable_parallel_hash = off;
 explain (costs off)
   select count(*) from simple r join simple s using (id);
@@ -2116,7 +2116,7 @@ rollback to settings;
 -- parallel with parallel-aware hash join
 savepoint settings;
 set local max_parallel_workers_per_gather = 2;
-set local work_mem = '192kB';
+set local work_mem = '64kB';
 set local enable_parallel_hash = on;
 explain (costs off)
   select count(*) from simple r join simple s using (id);
@@ -2136,7 +2136,7 @@ rollback to settings;
 -- non-parallel
 savepoint settings;
 set local max_parallel_workers_per_gather = 0;
-set local work_mem = '128kB';
+set local work_mem = '64kB';
 explain (costs off)
   select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
 select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
@@ -2150,7 +2150,7 @@ rollback to settings;
 -- parallel with parallel-oblivious hash join
 savepoint settings;
 set local max_parallel_workers_per_gather = 2;
-set local work_mem = '128kB';
+set local work_mem = '64kB';
 set local enable_parallel_hash = off;
 explain (costs off)
   select count(*) from simple r join bigger_than_it_looks s using (id);
@@ -2165,7 +2165,7 @@ rollback to settings;
 -- parallel with parallel-aware hash join
 savepoint settings;
 set local max_parallel_workers_per_gather = 1;
-set local work_mem = '192kB';
+set local work_mem = '64kB';
 set local enable_parallel_hash = on;
 explain (costs off)
   select count(*) from simple r join bigger_than_it_looks s using (id);
@@ -2186,7 +2186,7 @@ rollback to settings;
 -- non-parallel
 savepoint settings;
 set local max_parallel_workers_per_gather = 0;
-set local work_mem = '128kB';
+set local work_mem = '64kB';
 explain (costs off)
   select count(*) from simple r join extremely_skewed s using (id);
 select count(*) from simple r join extremely_skewed s using (id);
@@ -2199,7 +2199,7 @@ rollback to settings;
 -- parallel with parallel-oblivious hash join
 savepoint settings;
 set local max_parallel_workers_per_gather = 2;
-set local work_mem = '128kB';
+set local work_mem = '64kB';
 set local enable_parallel_hash = off;
 explain (costs off)
   select count(*) from simple r join extremely_skewed s using (id);
@@ -2213,7 +2213,7 @@ rollback to settings;
 -- parallel with parallel-aware hash join
 savepoint settings;
 set local max_parallel_workers_per_gather = 1;
-set local work_mem = '128kB';
+set local work_mem = '64kB';
 set local enable_parallel_hash = on;
 explain (costs off)
   select count(*) from simple r join extremely_skewed s using (id);
-- 
2.15.1

#44Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#43)
Re: pgsql: Add parallel-aware hash joins.

On Mon, Jan 22, 2018 at 11:17 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

It looks to me like longfin's average and
variation increased half way between fa330f9a and 18042840, somewhere
near Dec 7 to 9, when we went from ~40s +/- 1 to ~50s with several
seconds' variation. Was there some other environmental change then on
that machine?

Ahh. That corresponds to the release of this patch:

https://support.apple.com/en-nz/HT208331

And indeed I see from the logs that the kernel changed:

uname -v = Darwin Kernel Version 17.2.0: Fri Sep 29 18:27:05 PDT 2017;
root:xnu-4570.20.62~3/RELEASE_X86_64
uname -v = Darwin Kernel Version 17.3.0: Thu Nov 9 18:09:22 PST 2017;
root:xnu-4570.31.3~1/RELEASE_X86_64

--
Thomas Munro
http://www.enterprisedb.com

#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#44)
Re: pgsql: Add parallel-aware hash joins.

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Mon, Jan 22, 2018 at 11:17 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

It looks to me like longfin's average and
variation increased half way between fa330f9a and 18042840, somewhere
near Dec 7 to 9, when we went from ~40s +/- 1 to ~50s with several
seconds' variation. Was there some other environmental change then on
that machine?

Ahh. That corresponds to the release of this patch:
https://support.apple.com/en-nz/HT208331

Yeah, I'm pretty sure that I installed that update immediately when
it came out.

There were some other strange things going on on that machine too;
I remember having noticed and killed a daemon process that seemed to be
stuck at 100% CPU. That might explain the sudden dropoff in runtimes in
early January. And seeing that the times seem to be back up, I looked
just now, and there it is again:

PID COMMAND %CPU TIME #TH #WQ #PORT MEM PURG CMPRS PGRP
360 fontworker 75.9 238 hrs 4 4 45- 28M 0B 0B 360
359 fontd 25.3 86:18:50 4/1 3/1 74 33M+ 0B 1396K 359
76 logd 12.7 26:25:16 5 5 611 31M- 0B 0B 76
46241 top 2.6 00:01.04 1/1 0 23 2140K 0B 0B 46241

Weird --- my other macOS machines aren't doing that.

Anyway, it looks like we should write off these longfin timings as
ambient noise :-(

regards, tom lane

#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#45)
1 attachment(s)
Re: pgsql: Add parallel-aware hash joins.

I wrote:

Anyway, it looks like we should write off these longfin timings as
ambient noise :-(

Here's a possibly more useful graph of regression test timings over
the last year. I pulled this from the buildfarm database: it is the
reported runtime for the "installcheck-C" step in each successful
build of HEAD on dromedary, going back to Jan. 2017. I picked dromedary
because I know that that machine hasn't gotten any software updates
nor is there anything else very interesting going on on it. I dropped
three or four obvious outlier reports (possibly those ran during the
machine's nightly backup cron job). The reported runtime is only
precise to 1s, and a couple seconds jitter is hardly surprising, so
there's a good deal of noise. Still, it's possible to discern when
I put some effort into test runtime reduction back in April, and
it can be seen that things have gotten notably slower since the
beginning of November.

regards, tom lane

Attachments:

dromedary-times.pngimage/png; name=dromedary-times.pngDownload
�PNG


IHDR��,�)PLTE���������������@��  ��� �@����@����`��`��`��@��0`��`@@@@���`�``�`���`���@��`��`�`�����` �```    @@ @�`� `�``����@ � ����������  ���`����`�����`�@�@@�����`���������������������``������������� ����  � �� �  �@ �@��`��`�������@��@��`��p�����������T&�s�IDATx����v��@Q�����H@��!J���}W[*�v�8�kl���m���_�i�������m�>�V'�����������7#���������;�1pp�v�y�'����p���f�w�����N���Cs<s�'��i�����C~������7*_���/|����0���0}�e���[�aQ�^��q��6��N���s�����W7��E����%��,R�z����d�&�����x,��h�/*�"%%��hE�����>�����������~0��'�����h��~�|������mx���,��_~��GW\0��l�1�"����e��g�=����?��
�'���`�|>��y�������EOm��*"y~����E��b;�}r�3}_r�]<���i�����f�����I5�j$�H��T �@R
��I5�j$�H��T �@R
��I5�j$�H��T �@R
��I5�j$�Hw����@�I5�j$�H��T �������=�����}X���d��_K�#<�����4�����w|�{D}���X6���(�9�;��[�u���0y
����o^gx���<��F���I5�j$�H��4�$�H��T �@R
��I5�j$�H��T��@����B��H��T �@R
��I5�j$�H�0����GN>i4�pa������0�`�<.�y\��0�pa������0�`�<.�y\��.�����p��W�_0��p{�;�p]Zwt�������m���G>!+������+��?_;��(���@���65N�Q�����S����8���gn~
����ipr��pr�
��`W�����g��8-}��`1N�#pZ	���H�G��8�����;R1��ip$N�#pZ	��:hX	���H�G��8���8-��i]���I�G��8���8-���W����4��!o��/a���?$����m�p��8�m�p��8�m�p��8�m�p��8�m�}3���;��}{�8���[���M��`���}{��S6�o��p�����?N���=��)�����8eS���0�k�{���S6�o���;�O�6)N����O��wL��M����F<
��g82�xC
��o��������_�3 �&ohh$��n�����X|
�6'M.��z���a��^�phr���|~Zw�\��phr���MZ�,M.���84�x����
�C��7p�����X������r�����
��8+��'w��@�B�����$��'����7���
�-�&hX
�r�����
��8+��'����(Wh����{�e������f��v����l���7)�@f��Y����m��Bp��pY��4t�0��UW��e}0@k��Ux@��X
��`).�+�_T��&����y�V]p�����@�l��x�����x���6W�����x6����)My�������,�e���R�Z~@�
�K@V6���)X���r\K-���A?����������o��� ����m���.�c������O:��R\KpY,�e��z�����U������_���'?V�N�=�=.���_0�?����r\KM(����	9��=`�x��;���
/g/(�k�Zw����7d�m
P<0��_����������m�%�4���m��>�i8�[������B+~��������k�����_Dp�������k{�5O4	��O&����x�su,
9���N���K�
`���~9w��6`�mx�Oq^�;)@{����`*��4+6,<'%��)�ns����8'�9p0�	��pNl=@+�����8���>�`5��xkZr�E����^,��K&O;k�AeM��e/�)^����<
��������-�y��0z�r��������[���`�4��X��} �cS �is��{1]��$@�E@�2�L�_��xx� x2-��6��]0_�1p6@z&����D�c�����9+Vtp�w�2�����C�����M|�����$�Mbt�d�|�,,~r�f�R`|/g4�������@�OkXd��8��I�]�&?q��?�u�rV��16�xc
�`)���M����PdM|h���	@y=s���6��F�m�S����=���/��$�^����{
��������~��<hnLE�I=��K��	��c����w����������R��s�J�^��R\kKpI�L��;sr���SMtF�Z	�k���k�n9�����5������0�g�x�����h
'u`��x�$��
6�Q���;3���z��N�Ktg��I~����x8�8�����o��G`�\��
w��Z`j�
P<�`����������%@{�����w��������q��W���h�j��$�5�E�~���?��E�����_D0|
��~�q}�hS�J�X�[�
P��h���@�����\Kp],,	�u=�8���N���O�[�_.+W;�#��Bl���B�2���@������M8~y2mX��?O�B?0Q�NlxH�����s�@�
0�nv�$�������^���J�������\k�vi@�~N�0@@��@��l>@w���2@����%�_��s����j��x��6��tB���|�h�@@�kp],�=��4��v��M(/��0�2A\t��"w���m��?�w�Q����l��K���w�O����LX9[k{��/�G��������l��g���������4������d{[>�+�~����
��*�0>d���K����
�`K�W�����E@��%ujb�f�Xi>@kj����}T	����*h�7�������6������
�`K�]�cMp�PNi(���2%��A��"�{�1:����h�h@q���g�@�/_�rI)�b�1������V�Wh�c�p��R�[�����-U��O�v���SV`�.?����O�(���S.�U��|
m�T�1@�g���.�}'��{��$��O���O���O���O����k/��V\�c�]�nm���X����[��i�NHJ��R��s
���? �x�t��00�VX����r�g�U���������`z[/���d|Xp`�Vh�@3
�`c�������[����}}� �� U���gp��uN|`��
f�3�-%���n��k�
���gp�+{}�)X����	���.�M�}n�_��e�4	��
�G?��B�O��@�v���"�_�<���� U�*=,!��~�n��x�{����P3P5P�q�6L��@���9�L�m`X*^e�B6����������J� ��x�M������?��Z��~���?�p�j�j��Bot��w��n�j�h��a@�@���i�o5�������?
cU�j�j�j�j�j�j�j�j�j�j�	��,�K�@�@�@�@�@�@�@�@�@�@���{iJ�����{iJ�����{iJ�����{iJ7�������w�)@�n�9�1}�4���2��m�|�����}�F<
��U������?���4Nfg�<O�����h>�E�y\�o�"�<.��� ���X�`=.��� ���H��T �@R
��I5�j$�H��T �@R
��I5�j$�H��T �@R
��I5�j$�H��T �@R
��I5�j$�H��T �@R
��I5�j�����N���C�m�;����M|�!�����;�����_��}�:����4���c����:|�������I�D}����o�O~DM��� jig@R�`w�i����������x�_�E���}�7vL�_i���l��GN?��y���zm���/���<���)�������#Z�dg ��cZ�������W��w��'��cn�g����"��
��q����k���mr_���i���i}��<���p�w���[��_�\����}���e��N���������W�q`Z�<W��<P:�2��$<�����;v�'�����LL��2�#��o�_��4��-��5�u>�r^�n_
L��]���pcv��8&:���7vM�7W�)l8�'v�z^�nW�]�<��i�o����������i
'@y`�Ixd��,�>-;����z�$����]��6�D�����<	O�b���TwyEN^�K�%��%7=��Q���]�IEND�B`�
#47Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#46)
Re: pgsql: Add parallel-aware hash joins.

On Mon, Jan 22, 2018 at 6:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Here's a possibly more useful graph of regression test timings over
the last year. I pulled this from the buildfarm database: it is the
reported runtime for the "installcheck-C" step in each successful
build of HEAD on dromedary, going back to Jan. 2017. I picked dromedary
because I know that that machine hasn't gotten any software updates
nor is there anything else very interesting going on on it. I dropped
three or four obvious outlier reports (possibly those ran during the
machine's nightly backup cron job). The reported runtime is only
precise to 1s, and a couple seconds jitter is hardly surprising, so
there's a good deal of noise. Still, it's possible to discern when
I put some effort into test runtime reduction back in April, and
it can be seen that things have gotten notably slower since the
beginning of November.

Right, but this doesn't seem to show any big spike in the runtime at
the time when parallel hash was committed, or when the preparatory
patch to add test coverage for hash joins got committed. Rather,
there's a gradual increase over time. Either we're making the server
slower (which would be bad) or we're adding proper test coverage for
all the new features that we're adding (which would be good). We
can't expect every feature patch to preserve the runtime of the tests
absolutely unchanged; figuring out what can be optimized is a separate
exercise from adding test coverage either for new things or for things
that weren't previously covered.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#47)
3 attachment(s)
Re: pgsql: Add parallel-aware hash joins.

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jan 22, 2018 at 6:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Here's a possibly more useful graph of regression test timings over
the last year. I pulled this from the buildfarm database: it is the
reported runtime for the "installcheck-C" step in each successful
build of HEAD on dromedary, going back to Jan. 2017.

Right, but this doesn't seem to show any big spike in the runtime at
the time when parallel hash was committed, or when the preparatory
patch to add test coverage for hash joins got committed. Rather,
there's a gradual increase over time.

Well, there's just too much noise in this chart. Let's try another
machine: prairiedog, which is a lot slower so that the 1s resolution
isn't such a limiting factor, and it's also one that I know there hasn't
been much of any system change in.

The first attached PNG shows the "installcheck-C" runtime for as far
back as the buildfarm database has the data, and the second zooms in
on events since late 2016. As before, I've dropped individual outlier
results (those significantly slower than any nearby run) on the grounds
that they probably represent interference from nightly backups. I also
attached the raw data (including outliers) in case anyone wants to do
their own analysis.

There is a very clear secular trend up in the longer data series,
which indicates that we're testing more stuff, which doesn't bother
me in itself as long as the time is well spent. However, the trend
over the last two months is very bad, and I do not think that we can
point to any large improvement in test coverage that someone committed
since November.

Looking more closely at the shorter series, there are four pretty obvious
step changes since 2016-09. The PNG's x-axis doesn't have enough
resolution to match these up to commits, but looking at the underlying
data, they clearly correspond to:

Branch: master Release: REL_10_BR [b801e1200] 2016-10-18 15:57:58 -0400
Improve regression test coverage for hash indexes.

Branch: master Release: REL_10_BR [4a8bc39b0] 2017-04-12 16:17:53 -0400
Speed up hash_index regression test.

Branch: master [fa330f9ad] 2017-11-29 16:06:50 -0800
Add some regression tests that exercise hash join code.

Branch: master [180428404] 2017-12-21 00:43:41 -0800
Add parallel-aware hash joins.

I thought that the hash index test case was excessively expensive for
what it covered, and I'm now thinking the same about hash joins.

regards, tom lane

Attachments:

pdog-all.pngimage/png; name=pdog-all.pngDownload
�PNG


IHDR��,�)PLTE���������������@��  ��� �@����@����`��`��`��@��0`��`@@@@���`�``�`���`���@��`��`�`�����` �```    @@ @�`� `�``����@ � ����������  ���`����`�����`�@�@@�����`���������������������``������������� ����  � �� �  �@ �@��`��`�������@��@��`��p�����������T&�sIDATx������ �����������
������&S�H������6]�����A7���4��[��_�G�����-69yv��`���%A�v���W+��K|Xh�g�`o{e��9["�\��4�����(9p���W�m���8��}�>�AO8��v���oE�G�g���2��?m�&�TO���Bptj
-�9�������~6���n��p����X'����q�1��K�fE��p����������-�5��"��$v�����FU�� x���U*�7r���[��(�v��`�����#<�y~�A�T|��5g(S���!�k�����oOk@����}�U��&8�����d�l&����-��\�,l���0n�e�_��o�$�
�UXA��g8�pz��S<��'��A/`a<�-T��,%TA@H�T���	��:X�UJ����J_AC���	����|�4^���
�	�1��v�G�_  ����J[����AUTA@PAUTA@PA��U����*�  ����
�*�  ����
�*�  ����
�*�  ����J++�F�7���
�&w�`�;���TA@���]�g���AMn����������,��x+4���U�)�~�O�e7�|���p�h��?k����*l��|Dy��e�u�����|.�>R�?J���aq6h�i$`����Y0�����aN3�|��#zv
X���8;�4Il�  �`41�G���E@�AUTi) �p�A@�I�*�f����S��]um�N��F�# �Pc�����e�J��������	6�����=�]V�D8����x�}�y�9&�:BO:	h����@�i0���x���	�����z`T8h:"���	�}!P����i*`|��8�E[I5�d��-p�z6H��,�'L4��a�����jh��U�P���a#�GM�;�82��^�� �m�	`�&'��� ���7�5�� �X�
(3�{��(4�N�'�<!��%!��1^�M�����������IK_	�X��M�W�(�p,�YG������# l$x�D��L�2%������ �
��g/!�

?H�-0lw�N�� �88��%��I��-`��[f7�A������&'*��u�� �8D�0�w����PA�����z?�F@�����D�f`��K�������"�� �":��;g��1}�z�������������Z@�[x����^4�?i)`4�
�HsM(`��~�
F\
�.# HZ	�9?����J�~q	 �8��Y�[g�" �����������F�
(G*  ��Y�>�)������N6H�����n ��s��nr'$RU@���U�
e%Q����* �
Y�_?����RI@���*`�.��%t���H�h�y��}w[~�@��4�:40�"�PT�g���Z�13��t�t�����������/�#z����f*e/��������E������W���:.e�0���@�q@���@RU���p�(�k�i*8�9H*���P�����v���
$�9H�s6� `�l A���@�����;g	v���
$�9H�s6� `�l A���@�����;g	v���
$�9H�s6� `�l A���@�����;g	v���
$�9H�s6� `�l A���@�����;g	v���
$�9H�s6� `�l A���@�����;g	v���
$�9H�s6� `�l A���@�����;g	v���
$�9H�s6� `�l A@����M�d�dp��X�X��\q6H��u� �mHh��b�
	[1����������5`[��������54cN��C�g�hR�
~��z�`A@�l� �f��co����|0����l�������g�{2!�`��;g{=&B{Z��&�;g{=Q���������j>]0��Y?.�X6c�sV>����z�^�����>)�� �#A�'d{ n|JS9>#[7��i�)�N@c�`T����w� `mp��������bp��D(�d7��r���]s����.q��R��������nRtq�wI^;��K�C���J%�u�0�� `������������(�{���)'�t(����<��&����/�������U��W*��@|^�.����s-�o��h�/ m�'��h����0������/��l]�0;Ku���-,J0x���k�74{�z�V�o���2cL$�l�s������*?D�4��]/;��^u�-�G������v��)��QM�O�K���G�(�v:[����<D� ��a~/<�E��U1e	nn+����o��{�	h�����f������N/c�$>������������EW5��o���d������@��b�����2�G�O0��
h�zs��C(��<��-����pQR�{�) �� �6,H0���w�!`.(fR�;����}6h��S��b&����W�^(f�7��;��v�����J���Y�X@91�Gi��mr� �� ��0��U�<sA@;�:: ��E@��<�A��
H��^*`�h�*��+�-��)(���0)�VNA�2���''����)`���{�)���a�����|���:I���S���o���~�=����c�$�`�n�[wuD��0lG4��*��O@���(�#��{<
U^$`x�6���M@Fb=�
�f���h���E����A?^%`�}�r^$ ��A@P�5����*��@@�\�dT�L��e�`�����v���OA��4n��[�|>��t�������A}4S	h��o� �	4v���5Ux&��|oP#���A@PeV�h���d2��K�!}0�f"������G����%����1��Kx�!���`�pf����
�*Ch����QA@Pep7�n0.��-���W2�dt�78�L! '��������@����W$4A����A@PAUTA@PAUTA@PAUTA@PAUTA@P�@��?�����V
�A�2�u����g+��������2el
�M�6x
rSp|Ju�Q�>j@(��$$xF�(���s�bJ�/��0��hF��F�	@@PA���@@PeH��yQ@�~��gP9���4[�����8��#�T)��I�4#��QL��Q���D�;	H���  �2���<��0��! �����
�*��! '!�2��0+�  ����
�*�<E���!��  ��������}����6�U���|7J�-K�g���'�So���/�KGm3k�4b�U���.K��/h���h	LF?�w�6W�jo���Z	��o�����S���P��C��������Q���W��\pr%���>�n^5��x�9NLg����m?A�yA@P��g�_lWE��pn$�����	x���2����hf�������!0%CX?+<U�s��y���S��@@P����  ����
�*�  ����
�*O���\��� z��a��/��G6h�C������?99f;��la��WK���z�}�Vd���A�� `A����f��O�r�<��[����$�l��6����! ��Zn�~6�����p`B�� ����,<�2������������Ow���D��>.HpXW}v,��r������
O:������������=������}���.��G���������O�s
?����LO/]n����D�z��^,��|u#�>oK������p|E��N������_9_�����o����y.p������g����
��n���}���������	��K�lk �\W3�4|����e�5������}s�3��	v��]����:����oR��2�u?�V}�p��h���9�+���B���br?�nPx���m��������OJ�f<IEND�B`�
pdog-recent.pngimage/png; name=pdog-recent.pngDownload
�PNG


IHDR��,�)PLTE���������������@��  ��� �@����@����`��`��`��@��0`��`@@@@���`�``�`���`���@��`��`�`�����` �```    @@ @�`� `�``����@ � ����������  ���`����`�����`�@�@@�����`���������������������``������������� ����  � �� �  �@ �@��`��`�������@��@��`��p�����������T&�sIDATx�������E�8����MA�Hd��9�--D\��������^;�������nZ8�}#:5�}��:4V\#�����7�3�p�{���O:���x�S�;6�~�t?2�����<��\vQ�=��P=�����0�m��qZ[�m�a[w�6\��F����X���c3���Q�n=�g~��	�k,os�n/
�(����E2���pY�8�k{$}U���K��D����~�����3�
f+�d���F����%��Wuv�cC"
p���7�bqW�L�%g<`:��q�7�|��9��LG�zd�:�X7������R���`�.�v�����ki_<�'�0��;!�s~5����k��s���.A��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R��d
��clx�3���_&@4:�$@\����]`��{�A�3��%��% N��U0��M=�`D�K���p��	pbB�!E����	� ��%@ �(�m���_v�[��"�q��jM��L��!��j�L��Bj}�e�� F(���� F(�m�7 F�	0C��B�!5h�o��p�
�K~	[�th�q9�8��W�^(��&�[��hp(\�ep9#���x/\��wF�N���^�h���Z�%�{�,�����Hq��e���z*�P$ll���?�?�a��!�[���Xz<^�'6��\��d7F�28!@�n�Y��}��Lr���P�'(C��!�I� ����e�#��E�28!@��NP�'(C��!�	���epB�28!@��J�U��"@�C�*��U�������UpB�28!@��NP�'(C��!�	���epB�28!@K�T��+�/B�*��U���T!�?�B�P���
N��C=x��NP�'�/��`���|�A�� @����J������%@�0 @P��:������������?E���N����?�����c��]c�j�����F�����\5&���{�/����<������H���Kys���x���x�C��%Z�-���~�M�h�3w6���}�aAxzl�-=�w�nbF[��[���o�xL��-�i�h4��/�>����� �1c{>m/���e�%��:����d3\��H�#�v��o�/��ue���4@C��������n�h�����X����m`~D��E�i��%�Xh��`�W'�	p���;6"��p}��|``����G���,��3/�gM^K����`y�Wx�;pUs�����q�^%�V_�q���N������s���f����$�F8o�`3���kp�{�����`�<���s����h�Wf�O��Q�vK��-��[������f���`�7��k
���{���*���{���jpo��f��B���_3!�S�Ws�v���
��e��2�r��k�ul]���9�g��G	�n^=�����:GF�}�������:n#����3?�;i��`���(_2u��e�>/
psUj�+���4^���/p���J����F0v��&����|+������;^��n�����<�o����[�+t�Sw��8����3S����~�,�X��&�4� �`��ds� ��X���?�=<�G��l`��DGRw������S>�qN~n`%�=�m�#��/�~?��^��A���yR��:�"ln��oXe�u���
*���+���;[0�~1���Z�wmg��0+yj��c�U������hW�"���M��T�?�v�Yj��k�������~���`��
4Z������W5uk������S�k��X�����=������&�����^�
L���?�������Kk����������������au���n�0\)N^��\+�[����
������D]������
z����M��q�5��M�g(��h��m&�X9@s}�}�����kw�)x�}Q��\F�J6�}� �R��J����J'/��u;���.�`��4�Lm����������6�~u��v��Y�v#��W�S�>#�^�D�����
���:#:��|m�`������������$����;	���G�i��
���G���v�������6����w��,@w�e�����t���t�Ob���}(�����l�&9D��<����c�ow�}b�f���4qx-`i��N�Y/���uGZ�k�����xt�����,8�/h�`��us������t�/9�G��l����_����j
�R�q���-j�}���d>��g��4s�;�f���A�>�#�66�������x��j��,e������;'�^�����o�#fP��������&�^nH��S�n��j��w@.��������;@C�7���c���dg]lt���=w��Xx���B���n�M�k�E���;����D�����x�;�����K`���v}���3+�����k�g�E�]o��a�p�%�u�x�G��Zu���r-��<�.@K��=�j�`>�<@wZ��s�owA�6�N��h���u
��y�U=�Th��m`�*��;
�`�G���������M1�G;���3���.z������N{��2�^�yG�Sss���gl;�?�����pR�`[|p/]]'��|�3�|�Qk�6�������1N$3g��?���W���a�X��.��j!@��]&�(��������}q���K���s��V�v+��7���b��l>����3�
n��Pw��@��g�\�"�&��4=%���6����K��h�_���pf�$@�8��b/p.�ec���x�(�����������~e�����a��`��	p��~g>o�?6mmu�Wf��G�Z�$�!l��K�8������ �����G�8D������,�Da`8mK�������^���n��Wy
�@��p�������
��G �M8�M.�� XF�c��+�D�c��+�D��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H�H�����]��m���d����������#��w.��?.>�i�`� ����d�����\������% ���*�vB�Nofg�v��<��hB�!E��"@H���%w�p�<���Y�������_&
0Z��Z|����8_�om�j}�a9�_&���Z[�F���	\���Y~�7�:	��I4Z�_ct���
��:@��%ke��K�$	0��4���6�lzTz>����7(�h�m� T�uf�N�'

� 2��}����7^���-�N�	�4.@�]����dG��`8���s���a1�����]b���������-��/"ug���V�/����;��mJ�(�1@w�9�������o��5G~�B����T��B����L�<�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!56��g��g`�d��A���O��ouQ76<��'��)����?�X�������V}0p�������	�{����5N��p^����q�]@��M^���gm��������[V�����"�����^t����O���1�t���������S}C�?��s���u�.;H��/~����C�z��ca`wG������VO���U��<�f���O��ox�KC�"������9��C/7������k6��������>�`���M���~���?{7�D�TX~�k�O8�\����������}J�������EoL���x��������w{k��g������&���{��0p�gu��a�Co�_9x�O��
���`���&)z��K:zgv�^Z�
�*�.7p�����Co������`���*�>���+N�97����������'?��?,����?��7�����[��1����m���6+�?��ox�KC0�?K��>IEND�B`�
pdog-raw.txttext/plain; charset=us-ascii; name=pdog-raw.txtDownload
#49Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#48)
1 attachment(s)
Re: pgsql: Add parallel-aware hash joins.

On Wed, Jan 24, 2018 at 12:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There is a very clear secular trend up in the longer data series,
which indicates that we're testing more stuff,

+1

which doesn't bother
me in itself as long as the time is well spent. However, the trend
over the last two months is very bad, and I do not think that we can
point to any large improvement in test coverage that someone committed
since November.

I'm not sure if coverage.postgresql.org has a way to view historical
reports so we can see the actual percentage change, but as I recall
it, commit fa330f9ad "Add some regression tests that exercise hash
join code." pushed nodeHash.c and possibly nodeHashJoin.c into green
territory on here:

https://coverage.postgresql.org/src/backend/executor/index.html

Looking more closely at the shorter series, there are four pretty obvious
step changes since 2016-09. The PNG's x-axis doesn't have enough
resolution to match these up to commits, but looking at the underlying
data, they clearly correspond to:

Branch: master Release: REL_10_BR [b801e1200] 2016-10-18 15:57:58 -0400
Improve regression test coverage for hash indexes.

Branch: master Release: REL_10_BR [4a8bc39b0] 2017-04-12 16:17:53 -0400
Speed up hash_index regression test.

Branch: master [fa330f9ad] 2017-11-29 16:06:50 -0800
Add some regression tests that exercise hash join code.

Joining check runtimes with the commit log (see attached), I see:

2017-11-30 | fa330f9a | Add some regression tests that exercise | 00:08:30
2017-11-29 | 84940644 | New C function: bms_add_range | 00:08:18

That's +2.4%.

Branch: master [180428404] 2017-12-21 00:43:41 -0800
Add parallel-aware hash joins.

2017-12-21 | cce1ecfc | Adjust assertion in GetCurrentCommandId. | 00:09:03
2017-12-21 | 6719b238 | Rearrange execution of PARAM_EXTERN Para |
2017-12-21 | 8a0596cb | Get rid of copy_partition_key |
2017-12-21 | 9ef6aba1 | Fix typo |
2017-12-21 | c98c35cd | Avoid putting build-location-dependent s |
2017-12-21 | 59d1e2b9 | Cancel CV sleep during subtransaction ab |
2017-12-21 | 18042840 | Add parallel-aware hash joins. |
2017-12-20 | f94eec49 | When passing query strings to workers, p | 00:08:45

That's +3.4%. That's a bit more than I expected. I saw 2.5% on my
development box and hoped that'd be OK for a complex feature with a
lot of paths to test.

But hang on a minute -- how did we get to 08:45 from 08:30 between
those commits? Of course this is all noisy data and individual
samples are all over the place, but I think I see some signal here:

2017-12-20 | f94eec49 | When passing query strings to workers, p | 00:08:45
2017-12-19 | 7d3583ad | Test instrumentation of Hash nodes with | 00:08:43
2017-12-19 | 8526bcb2 | Try again to fix accumulation of paralle |
2017-12-19 | 38fc5470 | Re-fix wrong costing of Sort under Gathe | 00:08:31
2017-12-19 | 09a65f5a | Mark a few parallelism-related variables | 00:08:27

Both 8526bcb2 and 7d3583ad add Gather rescans. Doesn't 8526bcb2 add a
query that forks 40 times? I wonder how long it takes to start a
background worker on a Mac Cube. From that commit:

+         ->  Gather (actual rows=9800 loops=10)
+               Workers Planned: 4
+               Workers Launched: 4
+               ->  Parallel Seq Scan on tenk1 (actual rows=1960 loops=50)

I thought that the hash index test case was excessively expensive for
what it covered, and I'm now thinking the same about hash joins.

Does join-test-shrink.patch (from my earlier message) help much, on
prairiedog? It cut "check" time by ~1.5% on my low end machine.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

prairiedog-check-speed-vs-commit-log.txttext/plain; charset=US-ASCII; name=prairiedog-check-speed-vs-commit-log.txtDownload
#50Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#48)
Re: pgsql: Add parallel-aware hash joins.

On Tue, Jan 23, 2018 at 6:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Looking more closely at the shorter series, there are four pretty obvious
step changes since 2016-09. The PNG's x-axis doesn't have enough
resolution to match these up to commits, but looking at the underlying
data, they clearly correspond to:

Branch: master Release: REL_10_BR [b801e1200] 2016-10-18 15:57:58 -0400
Improve regression test coverage for hash indexes.

Branch: master Release: REL_10_BR [4a8bc39b0] 2017-04-12 16:17:53 -0400
Speed up hash_index regression test.

Branch: master [fa330f9ad] 2017-11-29 16:06:50 -0800
Add some regression tests that exercise hash join code.

Branch: master [180428404] 2017-12-21 00:43:41 -0800
Add parallel-aware hash joins.

I thought that the hash index test case was excessively expensive for
what it covered, and I'm now thinking the same about hash joins.

Hmm. I guess I'm insulated from some of the problem here by my choice
of hardware. On my laptop, 'make check' takes between 25.5 and 26
seconds (on 28e04155f17cabda7a18aee31d130aa10e25ee86). If I remove
the hash_index test from parallel_schedule, it still takes between
25.5 and 26 seconds. If I also remove the join test in its entirety,
it drops down to 24-24.5 seconds. If I put hash_index and join back
in the schedule file but revert join.sql and join.out to the version
just before fa330f9ad, it takes about 24.5 seconds. So for me, the
additional hash index tests don't cost anything measurable and the
additional hash join tests cost about a second. I think this probably
accounts for why committers other than you keep "adding so much time
to the regression tests". On modern hardware, the costs just don't
matter. As a further point of reference, on this machine, 9.5 stable
is 24.5-25 seconds, and 9.3 is 25.5-26 seconds, so from here it looks
like in the last 5 years the speed of 'make check' is within a half
second or so of the performance we had 5 years ago even though the
volume of the regression tests in terms of lines of SQL code has
increased by more than 50% in the same time period.

Now, how much should we care about the performance of software with a
planned release date of 2018 on hardware discontinued in 2001,
hardware that is apparently about 20 times slower than a modern
laptop? Some, perhaps, but maybe not a whole lot. Removing tests
that have found actual bugs because they cost runtime on ancient
systems that nobody uses for serious work doesn't make sense to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#51Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#47)
Re: pgsql: Add parallel-aware hash joins.

On 2018-01-23 14:24:56 -0500, Robert Haas wrote:

Right, but this doesn't seem to show any big spike in the runtime at
the time when parallel hash was committed, or when the preparatory
patch to add test coverage for hash joins got committed. Rather,
there's a gradual increase over time. Either we're making the server
slower (which would be bad) or we're adding proper test coverage for
all the new features that we're adding (which would be good). We
can't expect every feature patch to preserve the runtime of the tests
absolutely unchanged; figuring out what can be optimized is a separate
exercise from adding test coverage either for new things or for things
that weren't previously covered.

Agreed.

One the improvement front, my observation is that we rarely are actually
cpu bound across processes. One thing I've been wondering is whether we
can get a pretty large win from just rescheduling
parallel_schedule. There definitely are individual testfiles that take a
lot longer than others, but their positining in groups doesn't
necessarily reflect that.

Besides manually reordering the schedule, I think it might be time that
we improve pg_regress's scheduling. One big first step would e.g. be to
not manually limit the number of parallel tests in a group to 20, but
instead allow larger groups and only run a limited number of them in
parallel. If done right we could start the next test in a group as soon
as *one* task in a group has finished, rather than waiting for all of
them to finish as we currently do for (sub-)groups.

Besides larger groups, starting the next test(s) earlier, another way to
gain pretty large improvements would be a test schedule feature that
allowed to stat dependencies between tests. So instead of manually
grouping the schedule, have 'numerology' state that it depends on int2,
int4, int8, float4, float8, which means it can actually be started
earlier than it currently can in many cases.

Greetings,

Andres Freund

#52Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#50)
Re: pgsql: Add parallel-aware hash joins.

Hi,

On 2018-01-24 13:11:22 -0500, Robert Haas wrote:

So for me, the additional hash index tests don't cost anything
measurable and the additional hash join tests cost about a second. I
think this probably accounts for why committers other than you keep
"adding so much time to the regression tests". On modern hardware,
the costs just don't matter.

I very much agree with the general sentiment, but a second of a 25s test
certainly isn't nothing. As I've just written a few messages upthread,
I think we can hide the overall timing costs to a much larger degree
than we're doing, but I don't think we need not pay attention at all.

Now, how much should we care about the performance of software with a
planned release date of 2018 on hardware discontinued in 2001,
hardware that is apparently about 20 times slower than a modern
laptop? Some, perhaps, but maybe not a whole lot. Removing tests
that have found actual bugs because they cost runtime on ancient
systems that nobody uses for serious work doesn't make sense to me.

I again agree with the sentiment. One caveat is that old machines also
somewhat approximate testing with more instrumentation / debugging
enabled (say valgrind, CLOBBER_CACHE_ALWAYS, etc). So removing excessive
test overhead has still quite some benefits. But I definitely do not
want to lower coverage to achieve it.

Greetings,

Andres Freund

#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#52)
Re: pgsql: Add parallel-aware hash joins.

Andres Freund <andres@anarazel.de> writes:

On 2018-01-24 13:11:22 -0500, Robert Haas wrote:

Now, how much should we care about the performance of software with a
planned release date of 2018 on hardware discontinued in 2001,
hardware that is apparently about 20 times slower than a modern
laptop? Some, perhaps, but maybe not a whole lot. Removing tests
that have found actual bugs because they cost runtime on ancient
systems that nobody uses for serious work doesn't make sense to me.

I again agree with the sentiment.

I find that to be a completely bogus straw-man argument. The point of
looking at the prairiedog time series is just to see a data series in
which the noise level is small enough to discern the signal. If anyone's
got years worth of data off a more modern machine, and they can extract
a signal from that, by all means let's consider that data instead. But
there's no clear argument (or at least you have not made one) that says
that prairiedog's relative timings don't match what we'd get on more
modern machines.

Now, what *is* a relevant objection is that most of us care more about
the runtime of the parallelized regression tests than serial tests.
I did not use prairiedog's "make check" timings in these graphs because
that would include "make install" and initdb timings, adding noise and
overhead and probably making it harder to see what's going on. But
it's perfectly fair to want to optimize that case not the serial case.

However ... if you spend any time looking at the behavior of that,
the hashjoin tests are still problematic. I instrumented the parallel
tests by turning on log_disconnections so as to get per-test-script
timings, and what I find to be the slowest steps on my development
workstation are

[pg_regress/rowsecurity] 0.517
[pg_regress/partition_join] 0.535
[pg_regress/updatable_views] 0.546
[pg_regress/stats] 0.566
[pg_regress/triggers] 0.618
[pg_regress/foreign_data] 0.642
[pg_regress/stats_ext] 0.670
[pg_regress/select_parallel] 0.828
[pg_regress/create_index] 0.916
[pg_regress/alter_table] 1.187
[pg_regress/gist] 1.283
[pg_regress/join] 1.923
[pg_regress/plpgsql] 3.100

(The overall runtime for "make installcheck-parallel" on this machine
is about 17.3 seconds right now.) The next slowest test script in
the join test's group is "update", at 0.373 seconds; so over 1.5 sec
of the total 17.3 sec runtime is being spent solely in the join script.

Running the same test on the v10 branch, the slowest steps are

[pg_regress/join] 0.521
[pg_regress/rowsecurity] 0.521
[pg_regress/updatable_views] 0.554
[pg_regress/triggers] 0.624
[pg_regress/foreign_data] 0.647
[pg_regress/stats_ext] 0.675
[pg_regress/select_parallel] 0.690
[pg_regress/create_index] 0.928
[pg_regress/gist] 1.020
[pg_regress/alter_table] 1.120
[pg_regress/plpgsql] 3.217

so join has gotten about 1 second slower since v10, and that time is
coming entirely out of developers' hides despite parallelism because
it was already the slowest in its group.

So I continue to maintain that an unreasonable fraction of the total
resources devoted to the regular regression tests is going into these
new hashjoin tests.

Based on these numbers, it seems like one easy thing we could do to
reduce parallel check time is to split the plpgsql test into several
scripts that could run in parallel. But independently of that,
I think we need to make an effort to push hashjoin's time back down.

One caveat is that old machines also
somewhat approximate testing with more instrumentation / debugging
enabled (say valgrind, CLOBBER_CACHE_ALWAYS, etc). So removing excessive
test overhead has still quite some benefits. But I definitely do not
want to lower coverage to achieve it.

I don't want to lower coverage either. I do want some effort to be
spent on achieving test coverage intelligently, rather than just throwing
large test cases at the code without consideration of the costs.

regards, tom lane

#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#53)
2 attachment(s)
Re: pgsql: Add parallel-aware hash joins.

I wrote:

I find that to be a completely bogus straw-man argument. The point of
looking at the prairiedog time series is just to see a data series in
which the noise level is small enough to discern the signal. If anyone's
got years worth of data off a more modern machine, and they can extract
a signal from that, by all means let's consider that data instead.

Just to make the point, I scraped the numbers for skink's "installcheck"
and "check" steps, which are data series I imagine at least Andres will
concede are worth paying attention to. I made no attempt to clean
outliers, so these curves are pretty noisy, but I think there is a very
clear upward bump since mid-December.

regards, tom lane

Attachments:

skink-installcheck.pngimage/png; name=skink-installcheck.pngDownload
�PNG


IHDR��,�)PLTE���������������@��  ��� �@����@����`��`��`��@��0`��`@@@@���`�``�`���`���@��`��`�`�����` �```    @@ @�`� `�``����@ � ����������  ���`����`�����`�@�@@�����`���������������������``������������� ����  � �� �  �@ �@��`��`�������@��@��`��p�����������T&�s�IDATx������*�����;IT���&U��h��F�h4�{��WY_��K��G����`8�	��C�5Y�]*i�p�$�6���x!���9'w(���q��C�>i����^��"����`;��P4�\���r�!`]�����)��Pd�1���Ws��NM���B�����o��??��,Ke��"��S�"`�Fm�wgQ!�� `�x�?����x�<��q�e ��EL�q��NI���o���I��7~������=}S�����Ck`*��OVK�Nx���E��4���5�� ���]'�U���g[6:\��W��`��/�n��x�����N�����6����5Y�Et���7!�s  ���`
�)�  ���`
�)�  ���`�m��"����`
�)�  ���`
�)�  ���`
�)�  ���`
�)�  ���`
�)s'!�U���8�Wp:��"�t�_E�� �����A@����*�1u�_E�?pR�/��n ���8 ���p(���@@�� �$P'��28	�A�I �N�1N�p�� �$P'��28	�A�I �Nep(����&��I@�`_�8')f�pNR�#���.YA�0�&E�R,M��������X8	�A�I �Nep(������D�?��E[R,�"`�_�����1�x��"`��j0C��o�^�����A.3V�[��ZO��E���g@�o��3��>����-v�.U `�x���-������'`�"^
�u1�f@�9 `�f@�9 `��
�>|���P�f@�9 `�f@�9<F@wi�/�C����Z�Y�[ZA@]-���-� ����<��S������ru4�����u�������^���u"`������v��M��s��k!����B@!9����Br�WK?�8���s@�8�����!`�j@�� ��j@�� ��j�����P�KGo����8�b�ZG@?!�L����82)"���XL��#�"`1!�L����82)"���XL��#�~^!`�G&E�bB��	pdR,&4��<J��!`�0ptR,&D��I��G&E�b�Y�����~�����82)"�������E� !�L�	�^,P>OP(V-�z��Q��>��� `:�A�� �
��X����v��pdR,d\phR,d\ph�H����$`�ao/��
�7�{��*�7��F:�e\phR,d\phR,d\ph��?�����M�����Cl��[
�z��~0�L*�>h��jP� `oNl��L�y����p�s#��TZ�v�(�F@m)A���(�F@m)D�$G&�����Z.�:[ ���#�"`!���~�r�������v���F����G\4&E������<B�W�������h�%�e�-���x���=��j=�#�8~
p�^<��p	>/��
�s�Nl�{T�������	����	���.`������8���p��.Zz�u,_,���� ��zXP�MtXL��g��@���;	�0�xM@�+~@���n_<N�l�.1R@'m�TM+"`������U�����L{����|+qc7q.	Xy���+H����H��W�����*�~�H����R���"�9xV� �����q�U������)���f<9�v0����zf�,��(E���-��rik����(��n�P��d�L��7LP�SU:aMY��&��%��9�
���[���?�O	�=z�I��o���)���������/pw/z%p_����X��~H@[���.�������.����x���.:mn��pg��]����B0�=���0R@���)��pEA�t0
���UH�F@��e�!%	���?���������������f��	L�k��p�_�P@D=S��XD@�����N���L�jn `����g
�/�]�vJ#��?�<D��9?H���(gI-��������&�����ss��|��-�p�I�|s�X�����tw��"�����!��'�`Ha�����3��H(`F�
�v������(�QY�����,`��,�[���  ��+��	�H�<� �%��%����Q7<z��@��z���;��M�2~u0}����J@����#�%`n>0�-����70<��\�n-�b'`e�)��^�����lN���������O�8�^���N#���}�s��A������if[�}�O��F���Y:0iL�rK����&�q&��@0y��/����>�I�qq�T��,�07��O��oD�<Y��\�`�]&���I+��xggx��7����v{�0H�{#�k�I1��K[+�O��Q�	�y�����B�����Nv��QF�D������I�L���^�������(8��1���J�$��������������.l�(
�w��EGd	�9jY�T��6|���rQ��y��c�,�\����m���W
��� ���/�"�
!�����V�\V<o.	xe�[����0�v�!���)�:Z.	������6D��}����K�����i
]+	�r�A_��~K���IOn�h��$���,`�|t����Lth���\�W�w�,��x3�	(�s�w�%��K���&`�~s��w�������-d���n9������
��0�R���i�I�<��	����� `���x���;�	�/�:w�*w<�N3��+(�R�\Q�]6�O���xw�WKx:���}�gv"!�c�t��Ha�9�������	����
�������[���!`|W�q���\�Q%��P �b�����%�)L|�XOo�0%��b�������o�"����uP.'�!`��e�����Ob�+A�W�~M�7G���S�
 )�Z��~p��~���E���C_�c7��']����#���z*n�\���������p�'�l�}M��7��'X����=�5�	���������;n�W<��kB3��������(|��}�����]����*���+�[�Kh�����o�~��x�����A$���e_\�SLA@0��W����I�G���'��q��~S�\y+�&�}�A�i�d}��	�	�M�M���K0\��3��������c�
��SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0���u}�~-�E{R�M\��{��������]y������K���5Z��2�L�'���T��C�1��s��%.��`J�M����c���SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SLA@0�SL�����5�D�t���9&�<��:�,��;��&E����b�5Z��W����^�jg��4��]�W�Zz	�.�r�	/��A@������|�}�[�����.��e�����������}?��/���jlO��w=."�,q��MS�����p��e�hb�y��h�jT���}?�+�_�������������O0Ek�PSO��P z�~��Y���'����=B��v�%���h*k|b�w$*"��2f�����u��a��n���Le�1������	��5������@����^����	{�B�uM�=0�6F��$��']�fE�vf��Ti������������F,�& _c�N�lib���������ne���Y#�RU���~����#YWC�Z��I!��t��j��FG1���g������[��R�����r$�cR�*���� �&���\�������Te��{`
oQ1c��kR������1\��k�A�6M�P�B}]k90RIEND�B`�
skink-check.pngimage/png; name=skink-check.pngDownload
�PNG


IHDR��,�)PLTE���������������@��  ��� �@����@����`��`��`��@��0`��`@@@@���`�``�`���`���@��`��`�`�����` �```    @@ @�`� `�``����@ � ����������  ���`����`�����`�@�@@�����`���������������������``������������� ����  � �� �  �@ �@��`��`�������@��@��`��p�����������T&�sIDATx������8�a8��{nG�I|!����������&���?���\���A���z$@(l�������6�9����x0���$��y,6�p��=+���Y���s�)���ve	���%����w-o?�C���[��Q:G��;F���e�+&��9���\�	��1/���&"�c @X���7��7��w�N]?��J	P�S��-N�6R�nnx�����������[2���������ok-_wzKFo�%�w.�����tg����U�����]�w�&���u%�w �\q	.��{q~������.���W��+�	�R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R��VF�/C��"@H ���F��E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�.�y�����wS�R|���b�6�������UM�3�����}��o�%��|��N�QS��g@�� �&�/��6���	-t�1�WD�{ �R��4��GF���=@C�c�>@���� �R�����G��"@H �R)�B�!E��"@H ��
��� �R)�B�!E���$�oe�>)�B�!E��� @�� �R)�B���2#���!@C��E��"@H �R)�B�!E��"@H �R)�B��'|'�B�!�S������� �� �R���'B�!E��"@H �R)�T7���; ��
0�#�p�/uS��_b�J�����W��j������.��F�/���f��Z"��Ve��##�qD9.���M��1����	��R)�B�!�W�f"��yF�8,�B�!E���5@3�+t��� ����R!�����8�4�8�������8�����a&|��t ��u�D���Q��P�N84�T?~��=@���+��B���+�u7���� �^��R)�B�!E��"@H �R���q<_�:�B�!�s��%��
��B���B�!�u�?����b,)�B�!E��zJ���aE��"@H �R�g8}�G��C�� pP�GH�����0��G���G���G��oBG�$<�O"�s�$<�O"�s�$<�O"�s�$<G�	F;}+/	��'�O"�s�$<�O�he�/uS��c�<��������_����f �pf$��V�5�+������7�������B�'�v2�w��n�������7!��C_�	�u��.,8�����aj��k��c���O��z��]�*�t`��`rH�]x�J	���o^)�3x�v��:!��<%s��`�[<��%������ ��Vz*���O��`���<���I��1x
V-i�!�S�j��!�7!�/iO�1�SS��9	0=�X^_ ��%@A�9[G�����8�������`�K���n��~�v�%��k`��Z(���� ������$X<�Ub�w`�tX���/=���;�}�$���_������\~r6�w�;$py�g���^|w3	pg��l�B��S���#������1�T�Q�������qD��tX�w����{h���d���{�
�A}h�cL��x���_�����	p��w�����GL����
�y%yy�kwv��C�w���`b�{
�}�G�;�����*��H�{����ye����
p���\�D��=6���j�
0g�x�S,�������D���S���}M��c�����&�Xu��2��Zh���^�[�wwq���L�v�me�&�"���#7��v{�k�{r����i��&��*@�;\�z�����6�����'���� ���`VX���q[�(�h��3~�o[�+��l���uy�-~&�t'�&���`d��v�k��)��3[ds+<(@?�
�h`���4;��mS�<��/������<��W��?�����v&)��z�����\V�6O�1o����;�q���N����a�}������W}�~�uG�8��;�N��Qx}��Yp'��?a*�[�LuN���5�{�nsr�
�������&68�n��� -���a���{T���j:�y����k^�&��g���rwl���1`�AE�.H��M'0����������I���!��(@���02�q:
�9��kH��q'�v�=?��8YORx&��w�q��oP4@?�����f���qw$��j���_4��4���
�u�6l]������}��^�	�v�l��J��M�
�-m�]��9�X}��j��������~x�t��������L���fo��<�5w?��K� m/4�t^�F�yT�������X>:�lcW'��lBb����c��\Q���uk�Vaud��r��������g��	_�EVl���=�L�[�r^�}F����d���o�W���v����g��_�VF����������i���9�'������sWn�����.��N���|������e�4�������ky&I$[j�LY��	.�m<+���'����>��<�=�5.�z�	�7����O�.0�4���^T�I��j�����W�%�U�>Zu��=6��`GG��`��ix��� �R)�B�!E��"@H �R)�B���y����uo�W�w�he)�����_)^���om�&�/���f�f]�� q��,/����m#w��%� �*�����|�� R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@H �R)�B�!E��"@Hu`��;��gD��^1�=�Ib=���!x��g�W�V���`����c��`��Y���s�)���v�!�;������u
��\���V����`�f�vWs��{���^��0�]�d��|�%oy����*���}�&;X~r�?o����]�'�
���cec��>�[>������Y�C���U�U�#j�}��`�iyZ������$�!��q����3&���-g���o�c� ����`���=�<	Gs��{,�����d����#�$�!��������'��>py���d��j����d�q���l�5�u�������5I���q�L
��9�w���1��S�_�Jg���1��p�%���Q0��>5��o_���X2 =���2�G�dL�����m���w7s�����9�E�f����x/�>�ysD���bmBduEc�Y��=��"�{������s,�lU��e���|�L��	���$���������+y�w�7��T��u���n�G����5�hyw��9~w�7��s����J�p�UJ����^IEND�B`�
#55Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#53)
Re: pgsql: Add parallel-aware hash joins.

Hi,

On 2018-01-24 14:31:47 -0500, Tom Lane wrote:

However ... if you spend any time looking at the behavior of that,
the hashjoin tests are still problematic.

I think my main problem with your arguments is that you basically seem
to say that one of the more complex features in postgres can't increase
the test time. And I just don't agree with that. If we can reduce some
unnecessary overhead (as Thomas iirc has done somwhere nearby) - great,
if we can hide the overhead by scheduling the test better or breaking it
up - also great. But if that's a good chunk of work I think it's
entirely reasonable to not necessarily consider that the best use of
time.

It doesn't seem too surprising that a test that relies on starting
multiple background processes in multiple places will be among the more
expensive ones. We clearly would e.g. benefit from being able to reuse
workers, to avoid constantly starting/stopping them.

(The overall runtime for "make installcheck-parallel" on this machine
is about 17.3 seconds right now.) The next slowest test script in
the join test's group is "update", at 0.373 seconds; so over 1.5 sec
of the total 17.3 sec runtime is being spent solely in the join script.

Might be worth breaking up join a bit, that won't get rid of all the
wall time overhead, but should reduce it. Reordering to run parallel to
other slow tests might also be worthwhile.

So I continue to maintain that an unreasonable fraction of the total
resources devoted to the regular regression tests is going into these
new hashjoin tests.

One caveat is that old machines also
somewhat approximate testing with more instrumentation / debugging
enabled (say valgrind, CLOBBER_CACHE_ALWAYS, etc). So removing excessive
test overhead has still quite some benefits. But I definitely do not
want to lower coverage to achieve it.

I don't want to lower coverage either. I do want some effort to be
spent on achieving test coverage intelligently, rather than just throwing
large test cases at the code without consideration of the costs.

I think this accusation is unfair. Are you really suggesting that nobody
else cares about the runtime of the new tests? Just because other
people's tradeoffs come down at a somewhat different place, doesn't mean
they add tests "without consideration of the costs".

Based on these numbers, it seems like one easy thing we could do to
reduce parallel check time is to split the plpgsql test into several
scripts that could run in parallel. But independently of that,
I think we need to make an effort to push hashjoin's time back down.

If we had a dependency based system as I suggested nearby, we could have
pg_regress order the tests so that the slowest ones that have
dependencies fulfilled are started first...

Greetings,

Andres Freund

#56Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Andres Freund (#51)
Re: pgsql: Add parallel-aware hash joins.

Andres Freund wrote:

Besides larger groups, starting the next test(s) earlier, another way to
gain pretty large improvements would be a test schedule feature that
allowed to stat dependencies between tests. So instead of manually
grouping the schedule, have 'numerology' state that it depends on int2,
int4, int8, float4, float8, which means it can actually be started
earlier than it currently can in many cases.

Yeah, I proposed this a decade ago but never had the wits to write the
code.

It would be very useful for running tests standalone, too -- much as I
dislike 'make installcheck' taking half a minute, I dislike much more
having to take 5 minutes each time to figure out that create_table
depends on box, polygon, create_function, yadda yadda.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#57Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Thomas Munro (#49)
Re: pgsql: Add parallel-aware hash joins.

Thomas Munro wrote:

On Wed, Jan 24, 2018 at 12:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

However, the trend over the last two months is very bad, and I do
not think that we can point to any large improvement in test
coverage that someone committed since November.

I'm not sure if coverage.postgresql.org has a way to view historical
reports so we can see the actual percentage change,

It does not. I have had that in my to-do list for a while, but haven't
gotten around to it.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#58Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#56)
Re: pgsql: Add parallel-aware hash joins.

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Andres Freund wrote:

Besides larger groups, starting the next test(s) earlier, another way to
gain pretty large improvements would be a test schedule feature that
allowed to stat dependencies between tests. So instead of manually
grouping the schedule, have 'numerology' state that it depends on int2,
int4, int8, float4, float8, which means it can actually be started
earlier than it currently can in many cases.

Yeah, I proposed this a decade ago but never had the wits to write the
code.

There'd be a lot of followup work to sanitize the tests better. For
instance, if two tests transiently create tables named "foo", it doesn't
matter as long as they're not in the same group. It would matter with
this.

There are things we could do to mitigate that --- one attractive idea
is to have each test create its own schema for transient objects.
The limiting factor is that we don't want the test scripts to change
so much that back-patching tests becomes impossible. (Or at least,
I'd not like that.)

I'm concerned that we'd end up with a higher number of irreproducible
test failures with no good way to investigate them.

It would be very useful for running tests standalone, too -- much as I
dislike 'make installcheck' taking half a minute, I dislike much more
having to take 5 minutes each time to figure out that create_table
depends on box, polygon, create_function, yadda yadda.

Yeah, if we could say "run just this test and its needed precursors",
that'd be a huge win in a lot of situations.

Definitely seems like an idea worth pursuing.

regards, tom lane

#59Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#56)
Re: pgsql: Add parallel-aware hash joins.

Hi,

On 2018-01-24 17:18:26 -0300, Alvaro Herrera wrote:

Andres Freund wrote:

Besides larger groups, starting the next test(s) earlier, another way to
gain pretty large improvements would be a test schedule feature that
allowed to stat dependencies between tests. So instead of manually
grouping the schedule, have 'numerology' state that it depends on int2,
int4, int8, float4, float8, which means it can actually be started
earlier than it currently can in many cases.

Yeah, I proposed this a decade ago but never had the wits to write the
code.

It shouldn't be too hard, right? Leaving defining the file format,
parsing it, creating the new schedule with depencencies and adapting
tests aside (hah), it mostly seems a relatively simple graph ordering /
topological sort problem, right? If we keep the timings from an earlier
run somwhere, we can use the timing of runs as edge weights, making the
schedule better.

It would be very useful for running tests standalone, too -- much as I
dislike 'make installcheck' taking half a minute, I dislike much more
having to take 5 minutes each time to figure out that create_table
depends on box, polygon, create_function, yadda yadda.

Oh, that's a good point.

- Andres

#60Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#58)
Re: pgsql: Add parallel-aware hash joins.

Hi,

On 2018-01-24 15:36:35 -0500, Tom Lane wrote:

There'd be a lot of followup work to sanitize the tests better. For
instance, if two tests transiently create tables named "foo", it doesn't
matter as long as they're not in the same group. It would matter with
this.

Right. I suspect we'd initially end up with a schedule that'd had
dependencies pretty similar to what we have now as groups. I suspect
that even with a very small number of changes we'd get a lot better
timings.

There are things we could do to mitigate that --- one attractive idea
is to have each test create its own schema for transient objects.
The limiting factor is that we don't want the test scripts to change
so much that back-patching tests becomes impossible. (Or at least,
I'd not like that.)

I think a lot of temporary, potentially conflicting, objects are already
created as TEMPORARY. Adding a few more temporary markers shouldn't be
too hard.

I'm concerned that we'd end up with a higher number of irreproducible
test failures with no good way to investigate them.

Hm. We probably should dump the used ordering of tests somwhere upon
failure, to make it easier to debug.

Greetings,

Andres Freund

#61Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#59)
Re: pgsql: Add parallel-aware hash joins.

Andres Freund <andres@anarazel.de> writes:

On 2018-01-24 17:18:26 -0300, Alvaro Herrera wrote:

Yeah, I proposed this a decade ago but never had the wits to write the
code.

It shouldn't be too hard, right? Leaving defining the file format,
parsing it, creating the new schedule with depencencies and adapting
tests aside (hah), it mostly seems a relatively simple graph ordering /
topological sort problem, right?

Yeah. We already have topo sort code in pg_dump, maybe we could push that
into someplace like src/common or src/fe_utils? Although pg_dump hasn't
got any need for edge weights, so maybe sharing code isn't worth it.

We could flush the existing schedule files and use a simple format like
testname: list of earlier tests it depends on
(I guess there would be more properties than just the dependencies,
but still not hard to parse.)

If we keep the timings from an earlier
run somwhere, we can use the timing of runs as edge weights, making the
schedule better.

I think we could just use constant values hand-coded in the schedule file.
It might occasionally be worth updating them, but realistically it's not
going to matter that they be very accurate. Probably weights like 1, 2,
and 3 would be plenty ;-)

regards, tom lane

#62Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#60)
Re: pgsql: Add parallel-aware hash joins.

Andres Freund <andres@anarazel.de> writes:

On 2018-01-24 15:36:35 -0500, Tom Lane wrote:

I'm concerned that we'd end up with a higher number of irreproducible
test failures with no good way to investigate them.

Hm. We probably should dump the used ordering of tests somwhere upon
failure, to make it easier to debug.

The progress-display output of pg_regress would need a complete rethink
anyhow. First thought is to emit two lines per test, one when we
launch it and one when it finishes and we check the results:

foreign_data: launched
...
foreign_data: ok (or FAILED)

regards, tom lane

#63Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#61)
Re: pgsql: Add parallel-aware hash joins.

Hi,

On 2018-01-24 15:58:16 -0500, Tom Lane wrote:

Yeah. We already have topo sort code in pg_dump, maybe we could push that
into someplace like src/common or src/fe_utils? Although pg_dump hasn't
got any need for edge weights, so maybe sharing code isn't worth it.

I suspect it may be more work to share than worth it, but either way, it
shouldn't be too hard. Hm, isn't dbObjectTypePriority kinda an edge
weight? Seems like we properly could implement it as that.

We could flush the existing schedule files and use a simple format like
testname: list of earlier tests it depends on
(I guess there would be more properties than just the dependencies,
but still not hard to parse.)

Yea, I think there'd need to be a few more. There's some tests that use
multiple connections, and I suspect it'll be useful to have "implicit"
ordering dependencies for a few test, like a "barrier". Otherwise
e.g. the tablespace test will be annoying to order.

If we keep the timings from an earlier
run somwhere, we can use the timing of runs as edge weights, making the
schedule better.

I think we could just use constant values hand-coded in the schedule file.
It might occasionally be worth updating them, but realistically it's not
going to matter that they be very accurate. Probably weights like 1, 2,
and 3 would be plenty ;-)

The reason I like the idea of using prior tests as scheduling input is
that the slowness actually depends a lot on the type of machine its run
on, and more importantly on things like valgrind, CCA, fsync=on/off,
jit=on/off (far most expensive tests is e.g. the recursion test in
errors.sql :)).

Greetings,

Andres Freund

#64Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#53)
Re: pgsql: Add parallel-aware hash joins.

On Wed, Jan 24, 2018 at 2:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I find that to be a completely bogus straw-man argument. The point of
looking at the prairiedog time series is just to see a data series in
which the noise level is small enough to discern the signal. If anyone's
got years worth of data off a more modern machine, and they can extract
a signal from that, by all means let's consider that data instead. But
there's no clear argument (or at least you have not made one) that says
that prairiedog's relative timings don't match what we'd get on more
modern machines.

There is no need to collect years of data in order to tell whether or
not the time to run the tests has increased by as much on developer
machines as it has on prairiedog. You showed the time going from 3:36
to 8:09 between 2014 and the present. That is a 2.26x increase. It
is obvious from the numbers I posted before that no such increase has
taken place in the time it takes to run 'make check' on my relatively
modern laptop. Whatever difference exists is measured in
milliseconds.

so join has gotten about 1 second slower since v10, and that time is
coming entirely out of developers' hides despite parallelism because
it was already the slowest in its group.

So I continue to maintain that an unreasonable fraction of the total
resources devoted to the regular regression tests is going into these
new hashjoin tests.

I think there is an affirmative desire on the part of many
contributors to have newer features tested more thoroughly than old
ones were. That will tend to mean that features added more recently
have test suites that are longer-running compared to the value of the
feature they test than what we had in the past. When this has been
discussed at developer meetings, everyone except you (and to a lesser
extent me) has been in favor of this. Even if that meant that you had
to wait 1 extra second every time you run 'make check', I would judge
that worthwhile. But it probably doesn't, because there are a lot of
things that can be done to improve this situation, such as...

Based on these numbers, it seems like one easy thing we could do to
reduce parallel check time is to split the plpgsql test into several
scripts that could run in parallel. But independently of that,
I think we need to make an effort to push hashjoin's time back down.

...this. Also, the same technique could probably be applied to the
join test itself. I think Thomas just added the tests to that file
because it already existed, but there's nothing to say that the file
couldn't be split into several chunks. On a quick look, it looks to
me as though that file is testing a lot of pretty different things,
and it's one of the largest test case files, accounting for ~3% of the
total test suite by itself.

Another thing you could do is consider applying the patch Thomas
already posted to reduce the size of the tables involved. The problem
is that, for you and the buildfarm to be happy, the tests have to (1)
run near-instantaneously even on thoroughly obsolete hardware, (2)
give exactly the same answers on 32-bit systems, 64-bit systems,
Linux, Windows, AIX, HP-UX, etc., and (3) give those same exact
answers 100% deterministically on all of those platforms. Parallel
query is inherently non-deterministic about things like how much work
goes to each worker, and I think that really small tests will tend to
show more edge cases like one worker not doing anything. So it might
be that if we cut down the sizes of the test cases we'll spend more
time troubleshooting the resulting instability than any developer time
we would've saved by reducing the runtime. But we can try it.

One caveat is that old machines also
somewhat approximate testing with more instrumentation / debugging
enabled (say valgrind, CLOBBER_CACHE_ALWAYS, etc). So removing excessive
test overhead has still quite some benefits. But I definitely do not
want to lower coverage to achieve it.

I don't want to lower coverage either. I do want some effort to be
spent on achieving test coverage intelligently, rather than just throwing
large test cases at the code without consideration of the costs.

I don't believe that any such thing is occurring, and I think it's
wrong of you to imply that these test cases were added
unintelligently. To me, that seems like an ad hominum attack on both
Thomas (who spent a year or more developing the feature those test
cases exercise) and Andres (who committed them).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#65Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#62)
Re: pgsql: Add parallel-aware hash joins.

On Wed, Jan 24, 2018 at 4:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The progress-display output of pg_regress would need a complete rethink
anyhow. First thought is to emit two lines per test, one when we
launch it and one when it finishes and we check the results:

foreign_data: launched
...
foreign_data: ok (or FAILED)

I suspect that first line would only be interesting on the very rare
occasions when test hangs. I'm not sure it would be good to add that
much chatter.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#66Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#43)
Re: pgsql: Add parallel-aware hash joins.

Hi,

On 2018-01-22 23:17:47 +1300, Thomas Munro wrote:

On Thu, Dec 28, 2017 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Here is a patch that halves the size of the test tables used. I don't
want them to be too small because I want some some real parallel
processing at least sometimes. On my slowest system with assertions
enabled this brings "time make check" from ~37.5s to ~36.9s. I'm a
bit worried about choosing parameters that lead to instability across
the buildfarm, until I make the size estimation code a bit smarter[1],
so I tested a few variations that affect the alignment and size of
things (-m32, -malign-double, --disable-atomics, --disable-spinlocks)
and didn't see failures. Does this pass repeatedly on gaur?

I'd say, let's just commit it and see?

Will do so soon if nobody protests.

Greetings,

Andres Freund

#67Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tom Lane (#46)
1 attachment(s)
reducing isolation tests runtime

On the subject of test total time, we could paralelize isolation tests.
Right now "make check" in src/test/isolation takes 1:16 on my machine.
Test "timeouts" takes full 40s of that, with nothing running in parallel
-- the machine is completely idle.

Seems like we can have a lot of time back just by changing the schedule
to use multiple tests per line (in particular, put the other slow tests
together with timeouts), per the attached; with this new schedule,
isolation takes 44 seconds in my machine -- a win of 32 seconds. We can
win a couple of additional second by grouping a few other lines, but
this is the biggest win.

(This needs to be adjusted because some table names in the specs
conflict.)

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

isosched.patchtext/plain; charset=us-asciiDownload
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 74d7d59546..f8dbd018e7 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -1,3 +1,4 @@
+test: timeouts tuplelock-update deadlock-hard deadlock-soft-2 alter-table-1 receipt-report multiple-row-versions
 test: read-only-anomaly
 test: read-only-anomaly-2
 test: read-only-anomaly-3
@@ -6,26 +7,18 @@ test: read-write-unique-2
 test: read-write-unique-3
 test: read-write-unique-4
 test: simple-write-skew
-test: receipt-report
 test: temporal-range-integrity
 test: project-manager
 test: classroom-scheduling
 test: total-cash
 test: referential-integrity
 test: ri-trigger
-test: partial-index
-test: two-ids
-test: multiple-row-versions
-test: index-only-scan
+test: partial-index two-ids index-only-scan eval-plan-qual lock-update-delete lock-committed-update lock-committed-keyupdate tuplelock-conflict alter-table-2 alter-table-3 create-trigger
 test: deadlock-simple
-test: deadlock-hard
 test: deadlock-soft
-test: deadlock-soft-2
 test: fk-contention
 test: fk-deadlock
 test: fk-deadlock2
-test: eval-plan-qual
-test: lock-update-delete
 test: lock-update-traversal
 test: insert-conflict-do-nothing
 test: insert-conflict-do-nothing-2
@@ -38,12 +31,8 @@ test: delete-abort-savept-2
 test: aborted-keyrevoke
 test: multixact-no-deadlock
 test: multixact-no-forget
-test: lock-committed-update
-test: lock-committed-keyupdate
 test: update-locked-tuple
 test: propagate-lock-delete
-test: tuplelock-conflict
-test: tuplelock-update
 test: freeze-the-dead
 test: nowait
 test: nowait-2
@@ -56,13 +45,8 @@ test: skip-locked-3
 test: skip-locked-4
 test: drop-index-concurrently-1
 test: multiple-cic
-test: alter-table-1
-test: alter-table-2
-test: alter-table-3
 test: alter-table-4
-test: create-trigger
 test: sequence-ddl
 test: async-notify
 test: vacuum-reltuples
-test: timeouts
 test: vacuum-concurrent-drop
#68Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#67)
Re: reducing isolation tests runtime

On Wed, Jan 24, 2018 at 6:10 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On the subject of test total time, we could paralelize isolation tests.
Right now "make check" in src/test/isolation takes 1:16 on my machine.
Test "timeouts" takes full 40s of that, with nothing running in parallel
-- the machine is completely idle.

Seems like we can have a lot of time back just by changing the schedule
to use multiple tests per line (in particular, put the other slow tests
together with timeouts), per the attached; with this new schedule,
isolation takes 44 seconds in my machine -- a win of 32 seconds. We can
win a couple of additional second by grouping a few other lines, but
this is the biggest win.

(This needs to be adjusted because some table names in the specs
conflict.)

Oh, cool. Yes, the time the isolation tests take to run is quite
annoying. I didn't realize it would be so easy to run it in parallel.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#69Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#68)
Re: reducing isolation tests runtime

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Jan 24, 2018 at 6:10 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On the subject of test total time, we could paralelize isolation tests.

Oh, cool. Yes, the time the isolation tests take to run is quite
annoying. I didn't realize it would be so easy to run it in parallel.

+1 to both --- I hadn't realized we had enough infrastructure to do this
in parallel, either.

regards, tom lane

#70Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#66)
Re: pgsql: Add parallel-aware hash joins.

Andres Freund <andres@anarazel.de> writes:

On 2018-01-22 23:17:47 +1300, Thomas Munro wrote:

Here is a patch that halves the size of the test tables used.
... Does this pass repeatedly on gaur?

I'd say, let's just commit it and see?

Oh, sorry, I forgot I was on the hook to check that. The news isn't
good: I got one pass, and then this on the second iteration:

*** /home/postgres/pgsql/src/test/regress/expected/join.out     Wed Jan 24 18:07:42 2018
--- /home/postgres/pgsql/src/test/regress/results/join.out      Wed Jan 24 18:26:53 2018
***************
*** 6122,6128 ****
  $$);
   initially_multibatch | increased_batches 
  ----------------------+-------------------
!  t                    | f
  (1 row)
  rollback to settings;
--- 6122,6128 ----
  $$);
   initially_multibatch | increased_batches 
  ----------------------+-------------------
!  t                    | t
  (1 row)

rollback to settings;

======================================================================

regards, tom lane

#71Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#67)
Re: reducing isolation tests runtime

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

On the subject of test total time, we could paralelize isolation tests.
Right now "make check" in src/test/isolation takes 1:16 on my machine.
Test "timeouts" takes full 40s of that, with nothing running in parallel
-- the machine is completely idle.

BTW, one small issue there is that the reason the timeouts test is so
slow is that we have to use multi-second timeouts to be sure slower
buildfarm critters (eg valgrind animals) will get the expected results.
So I'm worried that if the machine isn't otherwise idle, we will get
random failures.

We could parallelize the rest of those tests and leave timeouts in its own
group. That cuts the payback a lot :-( but might still be worth doing.
Or maybe tweak things so that the buildfarm runs a serial schedule but
manual testing doesn't. Or we could debate how important the timeout
tests really are ... or think harder about how to make them reproducible.

regards, tom lane

#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#64)
Re: pgsql: Add parallel-aware hash joins.

Robert Haas <robertmhaas@gmail.com> writes:

There is no need to collect years of data in order to tell whether or
not the time to run the tests has increased by as much on developer
machines as it has on prairiedog. You showed the time going from 3:36
to 8:09 between 2014 and the present. That is a 2.26x increase. It
is obvious from the numbers I posted before that no such increase has
taken place in the time it takes to run 'make check' on my relatively
modern laptop. Whatever difference exists is measured in
milliseconds.

I may be wasting my breath here, but in one more attempt to convince
you that "time make check" on your laptop is not the only number that
anyone should be interested in, here are some timings off my development
workstation. These are timings off current tip of each release branch,
all the same build options etc, so very comparable:

9.4
[ don't remember the equivalent to top-level make temp-install here ]
top-level make check 24.725s
installcheck-parallel 15.383s
installcheck 27.560s

9.5
make temp-install 3.702s
initdb 2.328s
top-level make check 24.709s
installcheck-parallel 16.632s
installcheck 32.427s

9.6
make temp-install 3.971s
initdb 2.178s
top-level make check 24.048s
installcheck-parallel 15.889s
installcheck 32.775s

10
make temp-install 4.051s
initdb 1.363s
top-level make check 21.784s
installcheck-parallel 15.209s
installcheck 31.938s

HEAD
make temp-install 4.048s
initdb 1.361s
top-level make check 24.027s
installcheck-parallel 16.914s
installcheck 35.745s

I copied-and-pasted the "real time" results of time(1) for each of these,
not bothering to round them off; but the numbers are only reproducible
to half a second or so, so there's no significance in the last couple
digits. Most numbers above are the minimum of 2 or more runs.

What I take away here is that there's been a pretty steep cost increase
for the regression tests since v10, and that is *not* in line with the
historical average. In fact, in most years we've bought enough speedup
through performance improvements to pay for the test cases we added.
This is masked if you just eyeball "make check" compared to several years
ago. But to do that, you have to ignore the fact that we made substantial
improvements in the runtime of initdb as well as the regression tests
proper circa v10, and we've now thrown that away and more.

So I remain dissatisfied with these results, particularly because in
my own work habits, the time for "make installcheck-parallel" is way
more interesting than "make check". I avoid redoing installs and
initdbs if I don't need them.

... Even if that meant that you had
to wait 1 extra second every time you run 'make check', I would judge
that worthwhile.

I think this is a bad way of looking at it. Sure, in terms of
one developer doing one test run, a second or two means nothing.
But for instance, if you want to do 100 test runs in hope of catching
a seldom-reproduced bug, it adds up. It also adds up when you consider
the aggregate effort expended by the buildfarm, or the time you have
to wait to see buildfarm results.

Another thing you could do is consider applying the patch Thomas
already posted to reduce the size of the tables involved.

Yeah. What I thought this argument was about was convincing *you*
that that would be a reasonable patch to apply. It seems from my
experiment on gaur that that patch makes the results unstable, so
if we can do it at all it will need more work. But I do think
it's worth putting in some more sweat here. In the long run the
time savings will add up.

regards, tom lane

#73Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#72)
Re: pgsql: Add parallel-aware hash joins.

On Wed, Jan 24, 2018 at 11:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I may be wasting my breath here, but in one more attempt to convince
you that "time make check" on your laptop is not the only number that
anyone should be interested in, ...

Now that is not what I said, or at least not what I intended to say.
I'm taking the position that what happens on a developer laptop is
more relevant than what happens on an ancient buildfarm critter. I am
NOT taking the position that my particular laptop or even developer
laptops generally are the *only* thing that matters. I gave the
numbers from my laptop because it's the one I can test. I cannot
easily test yours.

What I take away here is that there's been a pretty steep cost increase
for the regression tests since v10, and that is *not* in line with the
historical average. In fact, in most years we've bought enough speedup
through performance improvements to pay for the test cases we added.
This is masked if you just eyeball "make check" compared to several years
ago. But to do that, you have to ignore the fact that we made substantial
improvements in the runtime of initdb as well as the regression tests
proper circa v10, and we've now thrown that away and more.

OK, I can see some increase there. It's definitely more for you than
it is for me, since you see something like a 10% slowdown between 10
and master and I see basically no difference. I don't know why that
should be, but I'm not doubting you.

So I remain dissatisfied with these results, particularly because in
my own work habits, the time for "make installcheck-parallel" is way
more interesting than "make check". I avoid redoing installs and
initdbs if I don't need them.

I'm not that efficient, but noted.

... Even if that meant that you had
to wait 1 extra second every time you run 'make check', I would judge
that worthwhile.

I think this is a bad way of looking at it. Sure, in terms of
one developer doing one test run, a second or two means nothing.
But for instance, if you want to do 100 test runs in hope of catching
a seldom-reproduced bug, it adds up. It also adds up when you consider
the aggregate effort expended by the buildfarm, or the time you have
to wait to see buildfarm results.

Sure, but as Andres said, you also have to consider how much developer
time it takes to recoup the savings. If it takes a day of development
time to save a second of regression test time, that might be worth it;
if it takes a month, color me doubtful, especially if the result is a
more fragile test that happens to pass on all of the buildfarm
critters we have now but might fail spuriously when somebody adds a
new one.

Yeah. What I thought this argument was about was convincing *you*
that that would be a reasonable patch to apply. It seems from my
experiment on gaur that that patch makes the results unstable, so
if we can do it at all it will need more work. But I do think
it's worth putting in some more sweat here. In the long run the
time savings will add up.

Why me? Thomas wrote the patch, Andres committed it, and you
complained about it. I'm just along for the ride.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#74Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tom Lane (#71)
Re: reducing isolation tests runtime

Tom Lane wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

On the subject of test total time, we could paralelize isolation tests.
Right now "make check" in src/test/isolation takes 1:16 on my machine.
Test "timeouts" takes full 40s of that, with nothing running in parallel
-- the machine is completely idle.

BTW, one small issue there is that the reason the timeouts test is so
slow is that we have to use multi-second timeouts to be sure slower
buildfarm critters (eg valgrind animals) will get the expected results.
So I'm worried that if the machine isn't otherwise idle, we will get
random failures.

I think we could solve this by putting in the same parallel group only
slow tests that mostly sleeps, ie. nothing that would monopolize CPU for
long enough to cause a problem. Concretely:

test: timeouts tuplelock-update deadlock-hard deadlock-soft-2

all of these tests have lots of sleeps and don't go through a lot of
data. (Compared to the previous patch, I removed alter-table-1, which
uses a thousand tuples, and multiple-row-versions, which uses 100k; also
removed receipt-report which uses a large number of permutations.)

Timings:
timeouts 40.3s
tuplelock-update 10.5s
deadlock-hard 10.9s
deadlock-soft-2 5.4s

alter-table-1 takes 1.5s, receipt-report 1.2s and there's nothing else
that takes above 1s, so I think this is good enough -- we can still have
the whole thing run in ~45 seconds without the hazard you describe.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#75Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#74)
Re: reducing isolation tests runtime

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Tom Lane wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

On the subject of test total time, we could paralelize isolation tests.

BTW, one small issue there is that the reason the timeouts test is so
slow is that we have to use multi-second timeouts to be sure slower
buildfarm critters (eg valgrind animals) will get the expected results.
So I'm worried that if the machine isn't otherwise idle, we will get
random failures.

I think we could solve this by putting in the same parallel group only
slow tests that mostly sleeps, ie. nothing that would monopolize CPU for
long enough to cause a problem. Concretely:
test: timeouts tuplelock-update deadlock-hard deadlock-soft-2

OK, but there'd better be a comment there explaining the concern
very precisely, or somebody will break it.

regards, tom lane

#76Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tom Lane (#75)
1 attachment(s)
Re: reducing isolation tests runtime

Tom Lane wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

I think we could solve this by putting in the same parallel group only
slow tests that mostly sleeps, ie. nothing that would monopolize CPU for
long enough to cause a problem. Concretely:
test: timeouts tuplelock-update deadlock-hard deadlock-soft-2

OK, but there'd better be a comment there explaining the concern
very precisely, or somebody will break it.

Here's a concrete proposal. Runtime is 45.7 seconds on my laptop. It
can be further reduced, but not by more than a second or two unless you
get in the business of modifying other tests. (I only modified
deadlock-soft-2 because it saves 5 seconds).

Admittedly the new isolation_schedule file is a bit ugly.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Parallelize-isolation-tests-a-bit.patchtext/plain; charset=us-asciiDownload
From f80a2fd9e3320d033017e17a648ac1ca13396ef0 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 25 Jan 2018 17:11:03 -0300
Subject: [PATCH] Parallelize isolation tests a bit

It is possible to run isolation tests in parallel, which saves quite a
bit of runtime (from 76 to 45 seconds in my machine).

Test 'timeout' has strict timing requirements, which may fail in animals
that are slow (either because of running under valgrind, doing cache
clobbering or expensive memory checks, or hardware is just slow), so
only put it in the same group with other tests that mostly sleep, to
avoid disruption.

Discussion: https://postgr.es/m/20180124231006.z7spaz5gkzbdvob5@alvherre.pgsql
---
 src/test/isolation/expected/deadlock-soft-2.out | 12 ++--
 src/test/isolation/isolation_schedule           | 84 +++++--------------------
 src/test/isolation/specs/deadlock-soft-2.spec   | 22 +++----
 3 files changed, 33 insertions(+), 85 deletions(-)

diff --git a/src/test/isolation/expected/deadlock-soft-2.out b/src/test/isolation/expected/deadlock-soft-2.out
index 14b0343ba4..5c33a5cdaa 100644
--- a/src/test/isolation/expected/deadlock-soft-2.out
+++ b/src/test/isolation/expected/deadlock-soft-2.out
@@ -1,12 +1,12 @@
 Parsed test spec with 4 sessions
 
 starting permutation: s1a s2a s2b s3a s4a s1b s1c s2c s3c s4c
-step s1a: LOCK TABLE a1 IN SHARE UPDATE EXCLUSIVE MODE;
-step s2a: LOCK TABLE a2 IN ACCESS SHARE MODE;
-step s2b: LOCK TABLE a1 IN SHARE UPDATE EXCLUSIVE MODE; <waiting ...>
-step s3a: LOCK TABLE a2 IN ACCESS EXCLUSIVE MODE; <waiting ...>
-step s4a: LOCK TABLE a2 IN ACCESS EXCLUSIVE MODE; <waiting ...>
-step s1b: LOCK TABLE a2 IN SHARE UPDATE EXCLUSIVE MODE; <waiting ...>
+step s1a: LOCK TABLE aa1 IN SHARE UPDATE EXCLUSIVE MODE;
+step s2a: LOCK TABLE aa2 IN ACCESS SHARE MODE;
+step s2b: LOCK TABLE aa1 IN SHARE UPDATE EXCLUSIVE MODE; <waiting ...>
+step s3a: LOCK TABLE aa2 IN ACCESS EXCLUSIVE MODE; <waiting ...>
+step s4a: LOCK TABLE aa2 IN ACCESS EXCLUSIVE MODE; <waiting ...>
+step s1b: LOCK TABLE aa2 IN SHARE UPDATE EXCLUSIVE MODE; <waiting ...>
 step s1b: <... completed>
 step s1c: COMMIT;
 step s2b: <... completed>
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 74d7d59546..0df00c6a53 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -1,68 +1,16 @@
-test: read-only-anomaly
-test: read-only-anomaly-2
-test: read-only-anomaly-3
-test: read-write-unique
-test: read-write-unique-2
-test: read-write-unique-3
-test: read-write-unique-4
-test: simple-write-skew
-test: receipt-report
-test: temporal-range-integrity
-test: project-manager
-test: classroom-scheduling
-test: total-cash
-test: referential-integrity
-test: ri-trigger
-test: partial-index
-test: two-ids
-test: multiple-row-versions
-test: index-only-scan
-test: deadlock-simple
-test: deadlock-hard
-test: deadlock-soft
-test: deadlock-soft-2
-test: fk-contention
-test: fk-deadlock
-test: fk-deadlock2
-test: eval-plan-qual
-test: lock-update-delete
-test: lock-update-traversal
-test: insert-conflict-do-nothing
-test: insert-conflict-do-nothing-2
-test: insert-conflict-do-update
-test: insert-conflict-do-update-2
-test: insert-conflict-do-update-3
-test: insert-conflict-toast
-test: delete-abort-savept
-test: delete-abort-savept-2
-test: aborted-keyrevoke
-test: multixact-no-deadlock
-test: multixact-no-forget
-test: lock-committed-update
-test: lock-committed-keyupdate
-test: update-locked-tuple
-test: propagate-lock-delete
-test: tuplelock-conflict
-test: tuplelock-update
-test: freeze-the-dead
-test: nowait
-test: nowait-2
-test: nowait-3
-test: nowait-4
-test: nowait-5
-test: skip-locked
-test: skip-locked-2
-test: skip-locked-3
-test: skip-locked-4
-test: drop-index-concurrently-1
-test: multiple-cic
-test: alter-table-1
-test: alter-table-2
-test: alter-table-3
-test: alter-table-4
-test: create-trigger
-test: sequence-ddl
-test: async-notify
-test: vacuum-reltuples
-test: timeouts
-test: vacuum-concurrent-drop
+# In the first group of parallel tests, test "timeout" might be susceptible to
+# additional load placed on the machine, so avoid anything in this group that
+# does much more other than sleeping.
+test: timeouts tuplelock-update deadlock-hard deadlock-soft-2
+
+test: read-only-anomaly read-write-unique temporal-range-integrity classroom-scheduling nowait-3 skip-locked-3 alter-table-3
+test: read-only-anomaly-2 read-write-unique-2 project-manager total-cash nowait-4 alter-table-4
+test: read-only-anomaly-3 read-write-unique-4 referential-integrity ri-trigger partial-index sequence-ddl aborted-keyrevoke
+test: read-write-unique-3 two-ids multiple-row-versions index-only-scan deadlock-simple async-notify fk-contention
+test: deadlock-soft simple-write-skew fk-deadlock alter-table-1 multiple-cic vacuum-reltuples skip-locked-4 receipt-report
+test: fk-deadlock2 eval-plan-qual lock-update-delete insert-conflict-do-nothing insert-conflict-toast
+test: insert-conflict-do-nothing-2 delete-abort-savept multixact-no-deadlock nowait-5 create-trigger
+test: insert-conflict-do-update lock-committed-update multixact-no-forget lock-committed-keyupdate lock-update-traversal alter-table-2
+test: insert-conflict-do-update-2 delete-abort-savept-2 propagate-lock-delete tuplelock-conflict freeze-the-dead
+test: insert-conflict-do-update-3 update-locked-tuple nowait skip-locked drop-index-concurrently-1
+test: skip-locked-2 vacuum-concurrent-drop nowait-2
diff --git a/src/test/isolation/specs/deadlock-soft-2.spec b/src/test/isolation/specs/deadlock-soft-2.spec
index 46b73bc082..9c643bc026 100644
--- a/src/test/isolation/specs/deadlock-soft-2.spec
+++ b/src/test/isolation/specs/deadlock-soft-2.spec
@@ -1,6 +1,6 @@
 # Soft deadlock requiring reversal of multiple wait-edges.  s1 must
-# jump over both s3 and s4 and acquire the lock on a2 immediately,
-# since s3 and s4 are hard-blocked on a1.
+# jump over both s3 and s4 and acquire the lock on aa2 immediately,
+# since s3 and s4 are hard-blocked on aa1.
 
 # The expected output for this test assumes that isolationtester will
 # detect step s1b as waiting before the deadlock detector runs and
@@ -10,35 +10,35 @@
 
 setup
 {
-  CREATE TABLE a1 ();
-  CREATE TABLE a2 ();
+  CREATE TABLE aa1 ();
+  CREATE TABLE aa2 ();
 }
 
 teardown
 {
-  DROP TABLE a1, a2;
+  DROP TABLE aa1, aa2;
 }
 
 session "s1"
 setup		{ BEGIN; SET deadlock_timeout = '5s'; }
-step "s1a"	{ LOCK TABLE a1 IN SHARE UPDATE EXCLUSIVE MODE; }
-step "s1b"	{ LOCK TABLE a2 IN SHARE UPDATE EXCLUSIVE MODE; }
+step "s1a"	{ LOCK TABLE aa1 IN SHARE UPDATE EXCLUSIVE MODE; }
+step "s1b"	{ LOCK TABLE aa2 IN SHARE UPDATE EXCLUSIVE MODE; }
 step "s1c"	{ COMMIT; }
 
 session "s2"
 setup		{ BEGIN; SET deadlock_timeout = '100s'; }
-step "s2a" 	{ LOCK TABLE a2 IN ACCESS SHARE MODE; }
-step "s2b" 	{ LOCK TABLE a1 IN SHARE UPDATE EXCLUSIVE MODE; }
+step "s2a" 	{ LOCK TABLE aa2 IN ACCESS SHARE MODE; }
+step "s2b" 	{ LOCK TABLE aa1 IN SHARE UPDATE EXCLUSIVE MODE; }
 step "s2c"	{ COMMIT; }
 
 session "s3"
 setup		{ BEGIN; SET deadlock_timeout = '100s'; }
-step "s3a"	{ LOCK TABLE a2 IN ACCESS EXCLUSIVE MODE; }
+step "s3a"	{ LOCK TABLE aa2 IN ACCESS EXCLUSIVE MODE; }
 step "s3c"	{ COMMIT; }
 
 session "s4"
 setup		{ BEGIN; SET deadlock_timeout = '100s'; }
-step "s4a"	{ LOCK TABLE a2 IN ACCESS EXCLUSIVE MODE; }
+step "s4a"	{ LOCK TABLE aa2 IN ACCESS EXCLUSIVE MODE; }
 step "s4c"	{ COMMIT; }
 
 permutation "s1a" "s2a" "s2b" "s3a" "s4a" "s1b" "s1c" "s2c" "s3c" "s4c"
-- 
2.11.0

#77Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#76)
Re: reducing isolation tests runtime

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Here's a concrete proposal. Runtime is 45.7 seconds on my laptop. It
can be further reduced, but not by more than a second or two unless you
get in the business of modifying other tests. (I only modified
deadlock-soft-2 because it saves 5 seconds).

Looks reasonable to me, but do we want to set any particular convention
about the max number of tests to run in parallel? If so there should
be at least a comment saying what.

Admittedly the new isolation_schedule file is a bit ugly.

Meh, seems fine.

We won't know if this really works till it hits the buildfarm,
I suppose.

regards, tom lane

#78Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Tom Lane (#77)
Re: reducing isolation tests runtime

Tom Lane wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Here's a concrete proposal. Runtime is 45.7 seconds on my laptop. It
can be further reduced, but not by more than a second or two unless you
get in the business of modifying other tests. (I only modified
deadlock-soft-2 because it saves 5 seconds).

Looks reasonable to me, but do we want to set any particular convention
about the max number of tests to run in parallel? If so there should
be at least a comment saying what.

Hmm, I ran this in a limited number of connections and found that it
fails with less than 27; and there's no MAX_CONNECTIONS like there is
for pg_regress. So I'll put this back on the drawing board until I'm
back from vacations ...

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#79Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#78)
Re: reducing isolation tests runtime

Hi,

On 2018-01-25 18:27:28 -0300, Alvaro Herrera wrote:

Tom Lane wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Here's a concrete proposal. Runtime is 45.7 seconds on my laptop. It
can be further reduced, but not by more than a second or two unless you
get in the business of modifying other tests. (I only modified
deadlock-soft-2 because it saves 5 seconds).

Looks reasonable to me, but do we want to set any particular convention
about the max number of tests to run in parallel? If so there should
be at least a comment saying what.

Hmm, I ran this in a limited number of connections and found that it
fails with less than 27; and there's no MAX_CONNECTIONS like there is
for pg_regress. So I'll put this back on the drawing board until I'm
back from vacations ...

I'd like to see this revived, getting a bit tired waiting longer and
longer to see isolationtester complete. Is it really a problem that we
require a certain number of connections? Something on the order of 30-50
connections ought not to be a real problem for realistic machines, and
if it's a problem for one, they can use a serialized schedule?

Greetings,

Andres Freund

#80Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#79)
Re: reducing isolation tests runtime

On 2018-Dec-04, Andres Freund wrote:

Hi,

I'd like to see this revived, getting a bit tired waiting longer and
longer to see isolationtester complete. Is it really a problem that we
require a certain number of connections? Something on the order of 30-50
connections ought not to be a real problem for realistic machines, and
if it's a problem for one, they can use a serialized schedule?

Hello

Yeah, me too. Let me see about it.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#81Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#79)
Re: reducing isolation tests runtime

Andres Freund <andres@anarazel.de> writes:

I'd like to see this revived, getting a bit tired waiting longer and
longer to see isolationtester complete. Is it really a problem that we
require a certain number of connections? Something on the order of 30-50
connections ought not to be a real problem for realistic machines, and
if it's a problem for one, they can use a serialized schedule?

The longstanding convention in the main regression tests is 20 max.
Is there a reason to be different here?

regards, tom lane

#82Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#81)
Re: reducing isolation tests runtime

Hi,

On 2018-12-04 15:45:39 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I'd like to see this revived, getting a bit tired waiting longer and
longer to see isolationtester complete. Is it really a problem that we
require a certain number of connections? Something on the order of 30-50
connections ought not to be a real problem for realistic machines, and
if it's a problem for one, they can use a serialized schedule?

The longstanding convention in the main regression tests is 20 max.
Is there a reason to be different here?

It's a bit less obvious from the outside how many connections a test
spawn - IOW, it might be easier to maintain the schedule if the cap
isn't as tight. And I'm doubtful that there's a good reason for the 20
limit these days, so going a bit higher seems reasonable.

Greetings,

Andres Freund

#83Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#76)
Re: reducing isolation tests runtime

Hi,

On 2018-01-25 17:34:15 -0300, Alvaro Herrera wrote:

Tom Lane wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

I think we could solve this by putting in the same parallel group only
slow tests that mostly sleeps, ie. nothing that would monopolize CPU for
long enough to cause a problem. Concretely:
test: timeouts tuplelock-update deadlock-hard deadlock-soft-2

OK, but there'd better be a comment there explaining the concern
very precisely, or somebody will break it.

Here's a concrete proposal. Runtime is 45.7 seconds on my laptop. It
can be further reduced, but not by more than a second or two unless you
get in the business of modifying other tests. (I only modified
deadlock-soft-2 because it saves 5 seconds).

I'm working an updated version of this. Adding the new tests is a bit
painful because of conflicting names making it harder than necessary to
schedule tests. While it's possible to work out a schedule that doesn't
conflict, it's pretty annoying to do and more importantly seems fragile
- it's very easy to create schedules that succeed on one machine, and
not on another, based on how slow which tests are.

I'm more inclined to be a bit more aggressive in renaming tables -
there's not much point in having a lot of "foo"s around. So I'm
inclined to rename some of the names that are more likely to
conflict. If we agree on doing that, I'd like to do that first, and
commit that more aggressively than the schedule itself.

An alternative approach would be to have isolationtester automatically
create a schema with the specfile's name, and place it in the search
path. But that'd make it impossible to use isolationtester against a
standby - which I think we currently don't do, but which probably would
be a good idea.

With regard to the schedule, I'm inclined to order it so that faster
test groups are earlier on, just to make it more likely to reach the
tests one is debugging faster. Does that sound sane?

Do we want to maintain a serial version of the schedule too? I'm
wondering if we should just generate both the isolationtester and plain
regression test schedule by either adding an option to pg_regress that
serializes test groups, or by generating the serial schedule file in a
few lines of perl.

Greetings,

Andres Freund

#84Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#83)
Re: reducing isolation tests runtime

Andres Freund <andres@anarazel.de> writes:

I'm working an updated version of this. Adding the new tests is a bit
painful because of conflicting names making it harder than necessary to
schedule tests. While it's possible to work out a schedule that doesn't
conflict, it's pretty annoying to do and more importantly seems fragile
- it's very easy to create schedules that succeed on one machine, and
not on another, based on how slow which tests are.

I'm more inclined to be a bit more aggressive in renaming tables -
there's not much point in having a lot of "foo"s around. So I'm
inclined to rename some of the names that are more likely to
conflict. If we agree on doing that, I'd like to do that first, and
commit that more aggressively than the schedule itself.

+1

Do we want to maintain a serial version of the schedule too?

Some of the slower buildfarm critters use MAX_CONNECTIONS to limit
the load on their hosts. As long as the isolation tests honor that,
I don't see a real need for a separate serial schedule.

(We've talked about retiring the serial sched for the main regression
tests, and while that trigger's not been pulled yet, I think it's
just a matter of time. So making the isolation tests follow that
precedent seems wrong anyway.)

regards, tom lane

#85Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#84)
Re: reducing isolation tests runtime

On 2019-Feb-13, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I'm working an updated version of this. Adding the new tests is a bit
painful because of conflicting names making it harder than necessary to
schedule tests. While it's possible to work out a schedule that doesn't
conflict, it's pretty annoying to do and more importantly seems fragile
- it's very easy to create schedules that succeed on one machine, and
not on another, based on how slow which tests are.

I'm more inclined to be a bit more aggressive in renaming tables -
there's not much point in having a lot of "foo"s around. So I'm
inclined to rename some of the names that are more likely to
conflict. If we agree on doing that, I'd like to do that first, and
commit that more aggressively than the schedule itself.

+1

+1

(Using separate schemas sounds a useful idea if we accumulate dozens of
tests, so I suggest that we do that for future tests, but for the time
being I wouldn't bother.)

Do we want to maintain a serial version of the schedule too?

Some of the slower buildfarm critters use MAX_CONNECTIONS to limit
the load on their hosts. As long as the isolation tests honor that,
I don't see a real need for a separate serial schedule.

MAX_CONNECTIONS was the only reason I didn't push this through. Do you
(Andres) have any solution to that?

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#86Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#85)
Re: reducing isolation tests runtime

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On 2019-Feb-13, Tom Lane wrote:

Some of the slower buildfarm critters use MAX_CONNECTIONS to limit
the load on their hosts. As long as the isolation tests honor that,
I don't see a real need for a separate serial schedule.

MAX_CONNECTIONS was the only reason I didn't push this through. Do you
(Andres) have any solution to that?

Doesn't the common pg_regress.c infrastructure handle that?
We might need to improve isolation_main.c and/or the isolation
Makefile to make it accessible.

I suppose that in what I'm thinking about, MAX_CONNECTIONS would be
interpreted as "max number of concurrent isolation scripts", which
is not exactly number of connections. A quick and dirty answer
would be to have isolation_main.c divide the limit by a factor of 4
or so.

regards, tom lane

#87Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#86)
Re: reducing isolation tests runtime

On 2019-02-13 10:58:50 -0500, Tom Lane wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On 2019-Feb-13, Tom Lane wrote:

Some of the slower buildfarm critters use MAX_CONNECTIONS to limit
the load on their hosts. As long as the isolation tests honor that,
I don't see a real need for a separate serial schedule.

MAX_CONNECTIONS was the only reason I didn't push this through. Do you
(Andres) have any solution to that?

Doesn't the common pg_regress.c infrastructure handle that?
We might need to improve isolation_main.c and/or the isolation
Makefile to make it accessible.

I suppose that in what I'm thinking about, MAX_CONNECTIONS would be
interpreted as "max number of concurrent isolation scripts", which
is not exactly number of connections. A quick and dirty answer
would be to have isolation_main.c divide the limit by a factor of 4
or so.

I guess that could work, although it's certainly not too pretty.
Alternatively we could pre-parse the spec files, but that's a bit
annoying given isolationtester.c is a separate c file...

Do you have an idea why we have both max_concurrent_tests *and*
max_connections in pg_regress? ISTM the former isn't really useful given
the latter?

Greetings,

Andres Freund

#88Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#87)
Re: reducing isolation tests runtime

Andres Freund <andres@anarazel.de> writes:

Do you have an idea why we have both max_concurrent_tests *and*
max_connections in pg_regress? ISTM the former isn't really useful given
the latter?

No, the former is a static restriction on what the schedule file is
allowed to contain, the latter is a dynamic restriction (that typically
is unlimited anyway).

regards, tom lane

#89Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#88)
Re: reducing isolation tests runtime

Hi,

On 2019-02-13 12:41:41 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Do you have an idea why we have both max_concurrent_tests *and*
max_connections in pg_regress? ISTM the former isn't really useful given
the latter?

No, the former is a static restriction on what the schedule file is
allowed to contain, the latter is a dynamic restriction (that typically
is unlimited anyway).

Right, but why don't we allow for more tests in a group, and then use a
default max_connections to limit concurrency? Having larger groups is
advantageous wrt test runtime - it reduces the number of artificial
serialization point where the slowest test slows things down. Obviously
there's still a few groups that are needed for test interdependency
management, but that's comparatively rare. We have have plenty groups
that are just broken up to stay below max_concurrent_tests.

Greetings,

Andres Freund

#90Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#89)
Re: reducing isolation tests runtime

Andres Freund <andres@anarazel.de> writes:

Right, but why don't we allow for more tests in a group, and then use a
default max_connections to limit concurrency? Having larger groups is
advantageous wrt test runtime - it reduces the number of artificial
serialization point where the slowest test slows things down. Obviously
there's still a few groups that are needed for test interdependency
management, but that's comparatively rare. We have have plenty groups
that are just broken up to stay below max_concurrent_tests.

Meh. That would also greatly increase the scope for hard-to-reproduce
conflicts between concurrent tests. I'm not especially excited about
going there.

regards, tom lane