parallel joins, and better parallel explain

Started by Robert Haasabout 10 years ago32 messages
#1Robert Haas
robertmhaas@gmail.com
1 attachment(s)

Attached find a patch that does (mostly) two things. First, it allows
the optimizer to generate plans where a Nested Loop or Hash Join
appears below a Gather node. This is a big improvement on what we
have today, where only a sequential scan can be parallelized; with
this patch, entire join problems can be parallelized, as long as they
don't need a Merge Join (see below for more on this). Second, it
improves the output of EXPLAIN when parallel workers are used. With
this patch, EXPLAIN (ANALYZE, VERBOSE) displays not only totals for
all workers, as it does currently in master, but also per-worker
statistics. Putting these two things together, you can get spiffy
stuff like this - thanks to Thom Brown for the query and sample data:

rhaas=# explain (analyze, verbose, costs off) SELECT count(*) FROM
contacts NATURAL JOIN countries WHERE continent = 'Africa';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Aggregate (actual time=602.527..602.527 rows=1 loops=1)
Output: count(*)
-> Gather (actual time=0.243..531.129 rows=1185951 loops=1)
Number of Workers: 2
-> Hash Join (actual time=0.206..396.106 rows=395317 loops=3)
Hash Cond: (contacts.country = countries.country)
Worker 0: actual time=0.260..485.785 rows=486461 loops=1
Worker 1: actual time=0.260..483.459 rows=480065 loops=1
-> Parallel Seq Scan on public.contacts (actual
time=0.034..143.824 rows=1666667 loops=3)
Output: contacts.id, contacts.first_name,
contacts.last_name, contacts.age, contacts.country
Worker 0: actual time=0.035..176.784 rows=2051492 loops=1
Worker 1: actual time=0.038..174.175 rows=2021506 loops=1
-> Hash (actual time=0.064..0.064 rows=59 loops=3)
Output: countries.country
Buckets: 1024 Batches: 1 Memory Usage: 11kB
Worker 0: actual time=0.070..0.070 rows=59 loops=1
Worker 1: actual time=0.069..0.069 rows=59 loops=1
-> Seq Scan on public.countries (actual
time=0.019..0.051 rows=59 loops=3)
Output: countries.country
Filter: (countries.continent = 'Africa'::text)
Rows Removed by Filter: 190
Worker 0: actual time=0.025..0.056 rows=59 loops=1
Worker 1: actual time=0.025..0.058 rows=59 loops=1
Planning time: 0.247 ms
Execution time: 603.285 ms
(25 rows)

The general theory of operation of this patch is that a Parallel Seq
Scan can be thought of as a "partial" path - that is, it can be run in
multiple workers and will produce part of the results in each worker;
when Gather is performed on those results, we get a complete result
set. For reasons that should be fairly clear on short reflection, a
join between a partial path for one of the two relations and an
ordinary path for the other produces a partial path for the result;
joining two partial paths would produce wrong answers. Thus, we
proceed by generating partial paths for each baserel and joinrel,
which can either be gathered to employ parallelism at that level, or
used to build partial paths for higher-level joinrels which can then
be gathered in turn.

As mentioned above, this patch doesn't try to generate Merge Join
paths at present. That could be changed, but the plans would probably
not be very good in most cases; the problem is of course that only one
of the two paths can be partial. So, if you had a sort on both sides,
each worker would have to sort part of one relation (which sounds
fine) and all of the other one (which sounds bad). You might
conceivably win with a Sort on one side and an Index Scan on the other
side, but probably not very often. The Gather at the top of the plan
tree is order-destroying, so it doesn't even help for the merge
ordering to match the final query ordering. I'll put some code in to
try partial Merge Join plans if there is a big hue and cry for it, but
personally my feeling is that it would be smarter to forget it for now
and write the code once we have some better options on the executor
side. See the "parallelism + sorting" thread for some discussion of
what kinds of executor nodes would be useful in being able to generate
better parallel merge joins. Some of that work might even get done in
time for 9.6, which would be nice.

I thought for a while that I might need some code to prevent the
generation of parallel plans in some cases that involve volatile
functions. For example, consider this query:

select (length(continent) - 3) % 10, count(*) from contacts natural
join countries where (length(continent) - 3) % 10 =
substr(timeofday()::text, 23, 1)::integer group by 1;

Without parallelism, this gets evaluated (on my machine, anyway) as a
hash join with countries on the inner side; the volatile but
parallel-safe filter condition is applied to the seq scan that feeds
the hash join. Thus the decision as to whether each row of the
countries table is in or out gets made just once. If this gets run
with a parallel hash join, each worker will make its own decision
about which rows to include in the hash table, and they probably won't
all make the same decisions. This means that the parallel hash join
could return a combination of rows that could never be returned by any
serial plan. That sounds bad, until you realize that if you disable
hash and merge joins and materiailzation, this can also be run as a
nested loop plan, in which case - if you can persuade the optimizer to
put the countries table on the inner side of the join - you can get
the executor to evaluate the filter condition on countries once for
every row in the contacts table, and of course there's nothing at all
that will make it give the same answer for each row each time.

After mulling it over a bit and studying the documentation, I realized
that marking a function "volatile" guarantees that it won't be
executed *less than once per base table row*. The use of a parallel
plan here doesn't violate that rule. "volatile" never guarantees that
the function won't be evaluated *more than once per base table row*.
So now I think this is all fine. If we're in a serial plan, a
volatile filter condition will get evaluated as little as once per
row, but maybe as many times as some nested loop around that scan
executes. In a parallel plan, it's the same thing, except that the
number of loops might be based on the number of parallel workers
rather than the number of rows in some outer table. That doesn't seem
like an important distinction.

This patch expands on and subsumes the patch I posted on the Parallel
Append thread. It therefore also does what was discussed over there:
pulls up Gather on top of any Append nodes generated by inheritance
expansion, which is a good idea for all the reasons already discussed
on that thread. Also as discussed on that thread, the cost model here
is still not particularly smart and probably needs improvement in a
variety of ways. Amit Kapila and others at EnterpriseDB will be
looking further into that, and I hope for more feedback from other
interested parties as well, but I don't think this patch needs to fix
everything that's wrong with parallel costing itself. Even
identifying those problems will probably take a fair amount of time
and research, and not having this functionality will not make finding
those problems any easier - probably the opposite.

The EXPLAIN portion of the patch should perhaps be separated out and
reviewed/committed separately. I developed it together because I found
that the current EXPLAIN output inadequate for understanding how work
was divided up between the leader and workers. The EXPLAIN changes
made it possible to figure that out. More improvements are possible,
but I think this is a big step up over what we have now, so I'm
anxious to press forward with it. Let me know if it's helpful to
split this part out for separate review; I've left it together for now
both because it makes testing the rest of the patch easier and because
it's 9:30pm on the day before Thanksgiving. (Happy Thanksgiving, BTW,
if you live someplace where that's a thing, as I do!)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel-join-v1.patchtext/x-patch; charset=US-ASCII; name=parallel-join-v1.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 183d3d9..12dae77 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -103,6 +103,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
 						   PlanState *planstate, ExplainState *es);
 static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
 static const char *explain_get_index_name(Oid indexId);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
 static void ExplainIndexScanDetails(Oid indexid, ScanDirection indexorderdir,
 						ExplainState *es);
 static void ExplainScanTarget(Scan *plan, ExplainState *es);
@@ -1437,108 +1438,73 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 	/* Show buffer usage */
 	if (es->buffers && planstate->instrument)
+		show_buffer_usage(es, &planstate->instrument->bufusage);
+
+	/* Show worker detail */
+	if (es->analyze && es->verbose && planstate->worker_instrument)
 	{
-		const BufferUsage *usage = &planstate->instrument->bufusage;
+		WorkerInstrumentation *w = planstate->worker_instrument;
+		bool		opened_group = false;
+		int			n;
 
-		if (es->format == EXPLAIN_FORMAT_TEXT)
+		for (n = 0; n < w->num_workers; ++n)
 		{
-			bool		has_shared = (usage->shared_blks_hit > 0 ||
-									  usage->shared_blks_read > 0 ||
-									  usage->shared_blks_dirtied > 0 ||
-									  usage->shared_blks_written > 0);
-			bool		has_local = (usage->local_blks_hit > 0 ||
-									 usage->local_blks_read > 0 ||
-									 usage->local_blks_dirtied > 0 ||
-									 usage->local_blks_written > 0);
-			bool		has_temp = (usage->temp_blks_read > 0 ||
-									usage->temp_blks_written > 0);
-			bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->blk_read_time) ||
-								 !INSTR_TIME_IS_ZERO(usage->blk_write_time));
+			Instrumentation *instrument = &w->instrument[n];
+			double		nloops = instrument->nloops;
+			double		startup_sec;
+			double		total_sec;
+			double		rows;
+
+			if (nloops <= 0)
+				continue;
+			startup_sec = 1000.0 * instrument->startup / nloops;
+			total_sec = 1000.0 * instrument->total / nloops;
+			rows = instrument->ntuples / nloops;
 
-			/* Show only positive counter values. */
-			if (has_shared || has_local || has_temp)
+			if (es->format == EXPLAIN_FORMAT_TEXT)
 			{
 				appendStringInfoSpaces(es->str, es->indent * 2);
-				appendStringInfoString(es->str, "Buffers:");
-
-				if (has_shared)
-				{
-					appendStringInfoString(es->str, " shared");
-					if (usage->shared_blks_hit > 0)
-						appendStringInfo(es->str, " hit=%ld",
-										 usage->shared_blks_hit);
-					if (usage->shared_blks_read > 0)
-						appendStringInfo(es->str, " read=%ld",
-										 usage->shared_blks_read);
-					if (usage->shared_blks_dirtied > 0)
-						appendStringInfo(es->str, " dirtied=%ld",
-										 usage->shared_blks_dirtied);
-					if (usage->shared_blks_written > 0)
-						appendStringInfo(es->str, " written=%ld",
-										 usage->shared_blks_written);
-					if (has_local || has_temp)
-						appendStringInfoChar(es->str, ',');
-				}
-				if (has_local)
+				appendStringInfo(es->str, "Worker %d: ", n);
+				if (es->timing)
+					appendStringInfo(es->str,
+							"actual time=%.3f..%.3f rows=%.0f loops=%.0f\n",
+								 startup_sec, total_sec, rows, nloops);
+				else
+					appendStringInfo(es->str,
+									 "actual rows=%.0f loops=%.0f\n",
+									 rows, nloops);
+				es->indent++;
+				if (es->buffers)
+					show_buffer_usage(es, &instrument->bufusage);
+				es->indent--;
+			}
+			else
+			{
+				if (!opened_group)
 				{
-					appendStringInfoString(es->str, " local");
-					if (usage->local_blks_hit > 0)
-						appendStringInfo(es->str, " hit=%ld",
-										 usage->local_blks_hit);
-					if (usage->local_blks_read > 0)
-						appendStringInfo(es->str, " read=%ld",
-										 usage->local_blks_read);
-					if (usage->local_blks_dirtied > 0)
-						appendStringInfo(es->str, " dirtied=%ld",
-										 usage->local_blks_dirtied);
-					if (usage->local_blks_written > 0)
-						appendStringInfo(es->str, " written=%ld",
-										 usage->local_blks_written);
-					if (has_temp)
-						appendStringInfoChar(es->str, ',');
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
 				}
-				if (has_temp)
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", n, es);
+
+				if (es->timing)
 				{
-					appendStringInfoString(es->str, " temp");
-					if (usage->temp_blks_read > 0)
-						appendStringInfo(es->str, " read=%ld",
-										 usage->temp_blks_read);
-					if (usage->temp_blks_written > 0)
-						appendStringInfo(es->str, " written=%ld",
-										 usage->temp_blks_written);
+					ExplainPropertyFloat("Actual Startup Time", startup_sec, 3, es);
+					ExplainPropertyFloat("Actual Total Time", total_sec, 3, es);
 				}
-				appendStringInfoChar(es->str, '\n');
-			}
+				ExplainPropertyFloat("Actual Rows", rows, 0, es);
+				ExplainPropertyFloat("Actual Loops", nloops, 0, es);
 
-			/* As above, show only positive counter values. */
-			if (has_timing)
-			{
-				appendStringInfoSpaces(es->str, es->indent * 2);
-				appendStringInfoString(es->str, "I/O Timings:");
-				if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
-					appendStringInfo(es->str, " read=%0.3f",
-							  INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
-				if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
-					appendStringInfo(es->str, " write=%0.3f",
-							 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
-				appendStringInfoChar(es->str, '\n');
+				if (es->buffers)
+					show_buffer_usage(es, &instrument->bufusage);
+
+				ExplainCloseGroup("Worker", NULL, true, es);
 			}
 		}
-		else
-		{
-			ExplainPropertyLong("Shared Hit Blocks", usage->shared_blks_hit, es);
-			ExplainPropertyLong("Shared Read Blocks", usage->shared_blks_read, es);
-			ExplainPropertyLong("Shared Dirtied Blocks", usage->shared_blks_dirtied, es);
-			ExplainPropertyLong("Shared Written Blocks", usage->shared_blks_written, es);
-			ExplainPropertyLong("Local Hit Blocks", usage->local_blks_hit, es);
-			ExplainPropertyLong("Local Read Blocks", usage->local_blks_read, es);
-			ExplainPropertyLong("Local Dirtied Blocks", usage->local_blks_dirtied, es);
-			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
-			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
-			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
-			ExplainPropertyFloat("I/O Read Time", INSTR_TIME_GET_MILLISEC(usage->blk_read_time), 3, es);
-			ExplainPropertyFloat("I/O Write Time", INSTR_TIME_GET_MILLISEC(usage->blk_write_time), 3, es);
-		}
+
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
 	}
 
 	/* Get ready to display the child plans */
@@ -2277,6 +2243,113 @@ explain_get_index_name(Oid indexId)
 }
 
 /*
+ * Show buffer usage details.
+ */
+static void
+show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+{
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		bool		has_shared = (usage->shared_blks_hit > 0 ||
+								  usage->shared_blks_read > 0 ||
+								  usage->shared_blks_dirtied > 0 ||
+								  usage->shared_blks_written > 0);
+		bool		has_local = (usage->local_blks_hit > 0 ||
+								 usage->local_blks_read > 0 ||
+								 usage->local_blks_dirtied > 0 ||
+								 usage->local_blks_written > 0);
+		bool		has_temp = (usage->temp_blks_read > 0 ||
+								usage->temp_blks_written > 0);
+		bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->blk_read_time) ||
+								 !INSTR_TIME_IS_ZERO(usage->blk_write_time));
+
+		/* Show only positive counter values. */
+		if (has_shared || has_local || has_temp)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfoString(es->str, "Buffers:");
+
+			if (has_shared)
+			{
+				appendStringInfoString(es->str, " shared");
+				if (usage->shared_blks_hit > 0)
+					appendStringInfo(es->str, " hit=%ld",
+									 usage->shared_blks_hit);
+				if (usage->shared_blks_read > 0)
+					appendStringInfo(es->str, " read=%ld",
+									 usage->shared_blks_read);
+				if (usage->shared_blks_dirtied > 0)
+					appendStringInfo(es->str, " dirtied=%ld",
+									 usage->shared_blks_dirtied);
+				if (usage->shared_blks_written > 0)
+					appendStringInfo(es->str, " written=%ld",
+									 usage->shared_blks_written);
+				if (has_local || has_temp)
+					appendStringInfoChar(es->str, ',');
+			}
+			if (has_local)
+			{
+				appendStringInfoString(es->str, " local");
+				if (usage->local_blks_hit > 0)
+					appendStringInfo(es->str, " hit=%ld",
+									 usage->local_blks_hit);
+				if (usage->local_blks_read > 0)
+					appendStringInfo(es->str, " read=%ld",
+									 usage->local_blks_read);
+				if (usage->local_blks_dirtied > 0)
+					appendStringInfo(es->str, " dirtied=%ld",
+									 usage->local_blks_dirtied);
+				if (usage->local_blks_written > 0)
+					appendStringInfo(es->str, " written=%ld",
+									 usage->local_blks_written);
+				if (has_temp)
+					appendStringInfoChar(es->str, ',');
+			}
+			if (has_temp)
+			{
+				appendStringInfoString(es->str, " temp");
+				if (usage->temp_blks_read > 0)
+					appendStringInfo(es->str, " read=%ld",
+									 usage->temp_blks_read);
+				if (usage->temp_blks_written > 0)
+					appendStringInfo(es->str, " written=%ld",
+									 usage->temp_blks_written);
+			}
+			appendStringInfoChar(es->str, '\n');
+		}
+
+		/* As above, show only positive counter values. */
+		if (has_timing)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfoString(es->str, "I/O Timings:");
+			if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
+				appendStringInfo(es->str, " read=%0.3f",
+						  INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
+			if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
+				appendStringInfo(es->str, " write=%0.3f",
+						 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
+			appendStringInfoChar(es->str, '\n');
+		}
+	}
+	else
+	{
+		ExplainPropertyLong("Shared Hit Blocks", usage->shared_blks_hit, es);
+		ExplainPropertyLong("Shared Read Blocks", usage->shared_blks_read, es);
+		ExplainPropertyLong("Shared Dirtied Blocks", usage->shared_blks_dirtied, es);
+		ExplainPropertyLong("Shared Written Blocks", usage->shared_blks_written, es);
+		ExplainPropertyLong("Local Hit Blocks", usage->local_blks_hit, es);
+		ExplainPropertyLong("Local Read Blocks", usage->local_blks_read, es);
+		ExplainPropertyLong("Local Dirtied Blocks", usage->local_blks_dirtied, es);
+		ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
+		ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
+		ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+		ExplainPropertyFloat("I/O Read Time", INSTR_TIME_GET_MILLISEC(usage->blk_read_time), 3, es);
+		ExplainPropertyFloat("I/O Write Time", INSTR_TIME_GET_MILLISEC(usage->blk_write_time), 3, es);
+	}
+}
+
+/*
  * Add some additional details about an IndexScan or IndexOnlyScan
  */
 static void
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 6730037..5d62558 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -49,20 +49,18 @@
 #define PARALLEL_TUPLE_QUEUE_SIZE		65536
 
 /* DSM structure for accumulating per-PlanState instrumentation. */
-typedef struct SharedPlanStateInstrumentation
-{
-	int plan_node_id;
-	slock_t mutex;
-	Instrumentation	instr;
-} SharedPlanStateInstrumentation;
-
-/* DSM structure for accumulating per-PlanState instrumentation. */
 struct SharedExecutorInstrumentation
 {
 	int instrument_options;
-	int ps_ninstrument;			/* # of ps_instrument structures following */
-	SharedPlanStateInstrumentation ps_instrument[FLEXIBLE_ARRAY_MEMBER];
+	int instrument_offset;		/* offset of first Instrumentation struct */
+	int num_workers;							/* # of workers */
+	int num_plan_nodes;							/* # of plan nodes */
+	int plan_node_id[FLEXIBLE_ARRAY_MEMBER];	/* array of plan node IDs */
+	/* array of num_plan_nodes * num_workers Instrumentation objects follows */
 };
+#define GetInstrumentationArray(sei) \
+	(AssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
+	 (Instrumentation *) (((char *) sei) + sei->instrument_offset))
 
 /* Context object for ExecParallelEstimate. */
 typedef struct ExecParallelEstimateContext
@@ -169,25 +167,25 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 	e->nnodes++;
 
 	/* Call estimators for parallel-aware nodes. */
-	switch (nodeTag(planstate))
+	if (planstate->plan->parallel_aware)
 	{
-		case T_SeqScanState:
-			ExecSeqScanEstimate((SeqScanState *) planstate,
-								e->pcxt);
-			break;
-		default:
-			break;
+		switch (nodeTag(planstate))
+		{
+			case T_SeqScanState:
+				ExecSeqScanEstimate((SeqScanState *) planstate,
+									e->pcxt);
+				break;
+			default:
+				break;
+		}
 	}
 
 	return planstate_tree_walker(planstate, ExecParallelEstimate, e);
 }
 
 /*
- * Ordinary plan nodes won't do anything here, but parallel-aware plan nodes
- * may need to initialize shared state in the DSM before parallel workers
- * are available.  They can allocate the space they previous estimated using
- * shm_toc_allocate, and add the keys they previously estimated using
- * shm_toc_insert, in each case targeting pcxt->toc.
+ * Initialize the dynamic shared memory segment that will be used to control
+ * parallel execution.
  */
 static bool
 ExecParallelInitializeDSM(PlanState *planstate,
@@ -196,31 +194,34 @@ ExecParallelInitializeDSM(PlanState *planstate,
 	if (planstate == NULL)
 		return false;
 
-	/* If instrumentation is enabled, initialize array slot for this node. */
+	/* If instrumentation is enabled, initialize slot for this node. */
 	if (d->instrumentation != NULL)
-	{
-		SharedPlanStateInstrumentation *instrumentation;
-
-		instrumentation = &d->instrumentation->ps_instrument[d->nnodes];
-		Assert(d->nnodes < d->instrumentation->ps_ninstrument);
-		instrumentation->plan_node_id = planstate->plan->plan_node_id;
-		SpinLockInit(&instrumentation->mutex);
-		InstrInit(&instrumentation->instr,
-				  d->instrumentation->instrument_options);
-	}
+		d->instrumentation->plan_node_id[d->nnodes] =
+			planstate->plan->plan_node_id;
 
 	/* Count this node. */
 	d->nnodes++;
 
-	/* Call initializers for parallel-aware plan nodes. */
-	switch (nodeTag(planstate))
+	/*
+	 * Call initializers for parallel-aware plan nodes.
+	 *
+	 * Ordinary plan nodes won't do anything here, but parallel-aware plan
+	 * nodes may need to initialize shared state in the DSM before parallel
+	 * workers are available.  They can allocate the space they previously
+	 * estimated using shm_toc_allocate, and add the keys they previously
+	 * estimated using shm_toc_insert, in each case targeting pcxt->toc.
+	 */
+	if (planstate->plan->parallel_aware)
 	{
-		case T_SeqScanState:
-			ExecSeqScanInitializeDSM((SeqScanState *) planstate,
-									 d->pcxt);
-			break;
-		default:
-			break;
+		switch (nodeTag(planstate))
+		{
+			case T_SeqScanState:
+				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
+										 d->pcxt);
+				break;
+			default:
+				break;
+		}
 	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeDSM, d);
@@ -307,6 +308,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
 	int			pstmt_len;
 	int			param_len;
 	int			instrumentation_len = 0;
+	int			instrument_offset;
 
 	/* Allocate object for return value. */
 	pei = palloc0(sizeof(ParallelExecutorInfo));
@@ -364,8 +366,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
 	if (estate->es_instrument)
 	{
 		instrumentation_len =
-			offsetof(SharedExecutorInstrumentation, ps_instrument)
-			+ sizeof(SharedPlanStateInstrumentation) * e.nnodes;
+			offsetof(SharedExecutorInstrumentation, plan_node_id)
+			+ sizeof(int) * e.nnodes;
+		instrumentation_len = MAXALIGN(instrumentation_len);
+		instrument_offset = instrumentation_len;
+		instrumentation_len += sizeof(Instrumentation) * e.nnodes * nworkers;
 		shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 	}
@@ -407,9 +412,17 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
 	 */
 	if (estate->es_instrument)
 	{
+		Instrumentation *instrument;
+		int		i;
+
 		instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
 		instrumentation->instrument_options = estate->es_instrument;
-		instrumentation->ps_ninstrument = e.nnodes;
+		instrumentation->instrument_offset = instrument_offset;
+		instrumentation->num_workers = nworkers;
+		instrumentation->num_plan_nodes = e.nnodes;
+		instrument = GetInstrumentationArray(instrumentation);
+		for (i = 0; i < nworkers * e.nnodes; ++i)
+			InstrInit(&instrument[i], estate->es_instrument);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
 					   instrumentation);
 		pei->instrumentation = instrumentation;
@@ -444,20 +457,31 @@ static bool
 ExecParallelRetrieveInstrumentation(PlanState *planstate,
 						  SharedExecutorInstrumentation *instrumentation)
 {
+	Instrumentation *instrument;
 	int		i;
+	int		n;
+	int		ibytes;
 	int		plan_node_id = planstate->plan->plan_node_id;
-	SharedPlanStateInstrumentation *ps_instrument;
 
 	/* Find the instumentation for this node. */
-	for (i = 0; i < instrumentation->ps_ninstrument; ++i)
-		if (instrumentation->ps_instrument[i].plan_node_id == plan_node_id)
+	for (i = 0; i < instrumentation->num_plan_nodes; ++i)
+		if (instrumentation->plan_node_id[i] == plan_node_id)
 			break;
-	if (i >= instrumentation->ps_ninstrument)
+	if (i >= instrumentation->num_plan_nodes)
 		elog(ERROR, "plan node %d not found", plan_node_id);
 
-	/* No need to acquire the spinlock here; workers have exited already. */
-	ps_instrument = &instrumentation->ps_instrument[i];
-	InstrAggNode(planstate->instrument, &ps_instrument->instr);
+	/* Accumulate the statistics from all workers. */
+	instrument = GetInstrumentationArray(instrumentation);
+	instrument += i * instrumentation->num_workers;
+	for (n = 0; n < instrumentation->num_workers; ++n)
+		InstrAggNode(planstate->instrument, &instrument[n]);
+
+	/* Also store the per-worker detail. */
+	ibytes = instrumentation->num_workers * sizeof(Instrumentation);
+	planstate->worker_instrument =
+		palloc(offsetof(WorkerInstrumentation, instrument) + ibytes);
+	planstate->worker_instrument->num_workers = instrumentation->num_workers;
+	memcpy(&planstate->worker_instrument->instrument, instrument, ibytes);
 
 	return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
 								 instrumentation);
@@ -568,7 +592,9 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 {
 	int		i;
 	int		plan_node_id = planstate->plan->plan_node_id;
-	SharedPlanStateInstrumentation *ps_instrument;
+	Instrumentation *instrument;
+
+	InstrEndLoop(planstate->instrument);
 
 	/*
 	 * If we shuffled the plan_node_id values in ps_instrument into sorted
@@ -576,20 +602,21 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 	 * if we're pushing down sufficiently large plan trees.  For now, do it
 	 * the slow, dumb way.
 	 */
-	for (i = 0; i < instrumentation->ps_ninstrument; ++i)
-		if (instrumentation->ps_instrument[i].plan_node_id == plan_node_id)
+	for (i = 0; i < instrumentation->num_plan_nodes; ++i)
+		if (instrumentation->plan_node_id[i] == plan_node_id)
 			break;
-	if (i >= instrumentation->ps_ninstrument)
+	if (i >= instrumentation->num_plan_nodes)
 		elog(ERROR, "plan node %d not found", plan_node_id);
 
 	/*
-	 * There's one SharedPlanStateInstrumentation per plan_node_id, so we
-	 * must use a spinlock in case multiple workers report at the same time.
+	 * Add our statistics to the per-node, per-worker totals.  It's possible
+	 * that this could happen more than once if we relaunched workers.
 	 */
-	ps_instrument = &instrumentation->ps_instrument[i];
-	SpinLockAcquire(&ps_instrument->mutex);
-	InstrAggNode(&ps_instrument->instr, planstate->instrument);
-	SpinLockRelease(&ps_instrument->mutex);
+	instrument = GetInstrumentationArray(instrumentation);
+	instrument += i * instrumentation->num_workers;
+	Assert(IsParallelWorker());
+	Assert(ParallelWorkerNumber < instrumentation->num_workers);
+	InstrAggNode(&instrument[ParallelWorkerNumber], planstate->instrument);
 
 	return planstate_tree_walker(planstate, ExecParallelReportInstrumentation,
 								 instrumentation);
@@ -607,13 +634,16 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 		return false;
 
 	/* Call initializers for parallel-aware plan nodes. */
-	switch (nodeTag(planstate))
+	if (planstate->plan->parallel_aware)
 	{
-		case T_SeqScanState:
-			ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
-			break;
-		default:
-			break;
+		switch (nodeTag(planstate))
+		{
+			case T_SeqScanState:
+				ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
+				break;
+			default:
+				break;
+		}
 	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeWorker, toc);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 012c14b..9099454 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1588,6 +1588,8 @@ _outPathInfo(StringInfo str, const Path *node)
 	else
 		_outBitmapset(str, NULL);
 	WRITE_BOOL_FIELD(parallel_aware);
+	WRITE_BOOL_FIELD(parallel_safe);
+	WRITE_INT_FIELD(parallel_degree);
 	WRITE_FLOAT_FIELD(rows, "%.0f");
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
@@ -1764,7 +1766,6 @@ _outGatherPath(StringInfo str, const GatherPath *node)
 	_outPathInfo(str, (const Path *) node);
 
 	WRITE_NODE_FIELD(subpath);
-	WRITE_INT_FIELD(num_workers);
 	WRITE_BOOL_FIELD(single_copy);
 }
 
@@ -1887,6 +1888,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_NODE_FIELD(reltargetlist);
 	WRITE_NODE_FIELD(pathlist);
 	WRITE_NODE_FIELD(ppilist);
+	WRITE_NODE_FIELD(partial_pathlist);
 	WRITE_NODE_FIELD(cheapest_startup_path);
 	WRITE_NODE_FIELD(cheapest_total_path);
 	WRITE_NODE_FIELD(cheapest_unique_path);
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 916a518..5019804 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -851,4 +851,57 @@ lateral reference.  (Perhaps now that that stuff works, we could relax the
 pullup restriction?)
 
 
--- bjm & tgl
+Parallel Query and Partial Paths
+--------------------------------
+
+Parallel query involves dividing up the work that needs to be performed
+either by an entire query or some portion of the query in such a way that
+some of that work can be done by one or more worker processes, which are
+called parallel workers.  Parallel workers are a subtype of dynamic
+background workers; see src/backend/access/transam/README.parallel for a
+fuller description.  Academic literature on parallel query suggests that
+that parallel execution strategies can be divided into essentially two
+categories: pipelined parallelism, where the execution of the query is
+divided into multiple stages and each stage is handled by a separate
+process; and partitioning parallelism, where the data is split between
+multiple processes and each process handles a subset of it.  The
+literature, however, suggests that gains from pipeline parallelism are
+often very limited due to the difficulty of avoiding pipeline stalls.
+Consequently, we do not currently attempt to generate query plans that
+use this technique.
+
+Instead, we focus on partitioning paralellism, which does not require
+that the underlying table be partitioned.  It only requires that (1)
+there is some method of dividing the data from at least one of the base
+tables involved in the relation across multiple processes, (2) allowing
+each process to handle its own portion of the data, and then (3)
+collecting the results.  Requirements (2) and (3) is satisfied by the
+executor node Gather, which launches any number of worker processes and
+executes its single child plan in all of them (and perhaps in the leader
+also, if the children aren't generating enough data to keep the leader
+busy).  Requirement (1) is handled by the SeqScan node: when invoked
+with parallel_aware = true, this node will, in effect, partition the
+table on a block by block basis, returning a subset of the tuples from
+the relation in each worker where that SeqScan is executed.  A similar
+scheme could be (and probably should be) implemented for bitmap heap
+scans.
+
+Just as we do for non-parallel access methods, we build Paths to
+represent access strategies that can be used in a parallel plan.  These
+are, in essence, the same strategies that are available in the
+non-parallel plan, but there is an important difference: a path that
+will run beneath a Gather node returns only a subset of the query
+results in each worker, not all of them.  To form a path that can
+actually be executed, the (rather large) cost of the Gather node must be
+accounted for.  For this reason among others, paths intended to run
+beneath a Gather node - which we call "partial" paths since they return
+only a subset of the results in each worker - must be kept separate from
+ordinary paths (see RelOptInfo's partial_pathlist and the function
+add_partial_path).
+
+One of the keys to making parallel query effective is to run as much of
+the query in parallel as possible.  Therefore, we expect it to generally
+be desirable to postpone the Gather stage until as near to the top of the
+plan as possible.  Expanding the range of cases in which more work can be
+pushed below the Gather (and costly them accurately) is likely to keep us
+busy for a long time to come.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 1fdcae5..0a11767 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -72,6 +72,7 @@ static void set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 				 Index rti, RangeTblEntry *rte);
 static void set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel,
 				   RangeTblEntry *rte);
+static void create_parallel_paths(PlannerInfo *root, RelOptInfo *rel);
 static void set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel,
 						  RangeTblEntry *rte);
 static bool function_rte_parallel_ok(RangeTblEntry *rte);
@@ -612,7 +613,6 @@ static void
 set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 {
 	Relids		required_outer;
-	int			parallel_threshold = 1000;
 
 	/*
 	 * We don't support pushing join clauses into the quals of a seqscan, but
@@ -624,39 +624,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer, 0));
 
-	/* Consider parallel sequential scan */
-	if (rel->consider_parallel && rel->pages > parallel_threshold &&
-		required_outer == NULL)
-	{
-		Path *path;
-		int parallel_degree = 1;
-
-		/*
-		 * Limit the degree of parallelism logarithmically based on the size
-		 * of the relation.  This probably needs to be a good deal more
-		 * sophisticated, but we need something here for now.
-		 */
-		while (rel->pages > parallel_threshold * 3 &&
-			   parallel_degree < max_parallel_degree)
-		{
-			parallel_degree++;
-			parallel_threshold *= 3;
-			if (parallel_threshold >= PG_INT32_MAX / 3)
-				break;
-		}
-
-		/*
-		 * Ideally we should consider postponing the gather operation until
-		 * much later, after we've pushed joins and so on atop the parallel
-		 * sequential scan path.  But we don't have the infrastructure for
-		 * that yet, so just do this for now.
-		 */
-		path = create_seqscan_path(root, rel, required_outer, parallel_degree);
-		path = (Path *)
-			create_gather_path(root, rel, path, required_outer,
-							   parallel_degree);
-		add_path(rel, path);
-	}
+	/* If appropriate, consider parallel sequential scan */
+	if (rel->consider_parallel && required_outer == NULL)
+		create_parallel_paths(root, rel);
 
 	/* Consider index scans */
 	create_index_paths(root, rel);
@@ -666,6 +636,54 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 }
 
 /*
+ * create_parallel_paths
+ *	  Build parallel access paths for a plain relation
+ */
+static void
+create_parallel_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int		parallel_threshold = 1000;
+	int		parallel_degree = 1;
+
+	/*
+	 * If this relation is too small to be worth a parallel scan, just return
+	 * without doing anything ... unless it's an inheritance child.  In that case,
+	 * we want to generate a parallel path here anyway.  It might not be worthwhile
+	 * just for this relation, but when combined with all of its inheritance siblings
+	 * it may well pay off.
+	 */
+	if (rel->pages < parallel_threshold && rel->reloptkind == RELOPT_BASEREL)
+		return;
+
+	/*
+	 * Limit the degree of parallelism logarithmically based on the size of the
+	 * relation.  This probably needs to be a good deal more sophisticated, but we
+	 * need something here for now.
+	 */
+	while (rel->pages > parallel_threshold * 3 &&
+		   parallel_degree < max_parallel_degree)
+	{
+		parallel_degree++;
+		parallel_threshold *= 3;
+		if (parallel_threshold >= PG_INT32_MAX / 3)
+			break;
+	}
+
+	/* Add an unordered partial path based on a parallel sequential scan. */
+	add_partial_path(rel, create_seqscan_path(root, rel, NULL, parallel_degree));
+
+	/*
+	 * If this is a baserel, consider gathering any partial paths we may have
+	 * just created.  If we gathered an inheritance child, we could end up with
+	 * a very large number of gather nodes, each trying to grab its own pool of
+	 * workers, so don't do this in that case.  Instead, we'll consider gathering
+	 * partial paths for the appendrel.
+	 */
+	if (rel->reloptkind == RELOPT_BASEREL)
+		generate_gather_paths(root, rel);
+}
+
+/*
  * set_tablesample_rel_size
  *	  Set size estimates for a sampled relation
  */
@@ -1039,6 +1057,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	List	   *live_childrels = NIL;
 	List	   *subpaths = NIL;
 	bool		subpaths_valid = true;
+	List	   *partial_subpaths = NIL;
+	bool		partial_subpaths_valid = true;
 	List	   *all_child_pathkeys = NIL;
 	List	   *all_child_outers = NIL;
 	ListCell   *l;
@@ -1093,6 +1113,13 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		else
 			subpaths_valid = false;
 
+		/* Same idea, but for a partial plan. */
+		if (childrel->partial_pathlist != NIL)
+			partial_subpaths = accumulate_append_subpath(partial_subpaths,
+										linitial(childrel->partial_pathlist));
+		else
+			partial_subpaths_valid = false;
+
 		/*
 		 * Collect lists of all the available path orderings and
 		 * parameterizations for all the children.  We use these as a
@@ -1164,7 +1191,38 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 * if we have zero or one live subpath due to constraint exclusion.)
 	 */
 	if (subpaths_valid)
-		add_path(rel, (Path *) create_append_path(rel, subpaths, NULL));
+		add_path(rel, (Path *) create_append_path(rel, subpaths, NULL, 0));
+
+	/*
+	 * Consider an append of partial unordered, unparameterized partial paths.
+	 */
+	if (partial_subpaths_valid)
+	{
+		AppendPath *appendpath;
+		ListCell *lc;
+		int parallel_degree = 0;
+
+		/*
+		 * Decide what parallel degree to request for this append path.  For
+		 * now, we just use the maximum parallel degree of any member.  It
+		 * might be useful to use a higher number if the Append node were smart
+		 * enough to spread out the workers, but it currently isn't.
+		 */
+		foreach (lc, partial_subpaths)
+		{
+			Path *path = lfirst(lc);
+			parallel_degree = Max(parallel_degree, path->parallel_degree);
+		}
+		Assert(parallel_degree > 0);
+
+		/* Generate a partial append path. */
+		appendpath = create_append_path(rel, partial_subpaths, NULL,
+										parallel_degree);
+		add_partial_path(rel, (Path *) appendpath);
+
+		/* Consider gathering it. */
+		generate_gather_paths(root, rel);
+	}
 
 	/*
 	 * Also build unparameterized MergeAppend paths based on the collected
@@ -1214,7 +1272,7 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 		if (subpaths_valid)
 			add_path(rel, (Path *)
-					 create_append_path(rel, subpaths, required_outer));
+					 create_append_path(rel, subpaths, required_outer, 0));
 	}
 }
 
@@ -1440,8 +1498,9 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 
 	/* Discard any pre-existing paths; no further need for them */
 	rel->pathlist = NIL;
+	rel->partial_pathlist = NIL;
 
-	add_path(rel, (Path *) create_append_path(rel, NIL, NULL));
+	add_path(rel, (Path *) create_append_path(rel, NIL, NULL, 0));
 
 	/*
 	 * We set the cheapest path immediately, to ensure that IS_DUMMY_REL()
@@ -1844,6 +1903,35 @@ set_worktable_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 }
 
 /*
+ * generate_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather on
+ *		top of a partial path.
+ */
+void
+generate_gather_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	Path   *cheapest_partial_path;
+	Path   *simple_gather_path;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/*
+	 * The output of Gather is currently always unsorted, so there's only one
+	 * partial path of interest: the cheapest one.
+	 *
+	 * Eventually, we should have a Gather Merge operation that can merge multiple
+	 * tuple streams together while preserving their ordering.  We could usefully
+	 * generate such a path from each partial path that has non-NIL pathkeys.
+	 */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+	simple_gather_path = (Path *)
+		create_gather_path(root, rel, cheapest_partial_path, NULL);
+	add_path(rel, simple_gather_path);
+}
+
+/*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
  *
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 990486c..fd9a7da 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -186,8 +186,7 @@ clamp_row_est(double nrows)
  */
 void
 cost_seqscan(Path *path, PlannerInfo *root,
-			 RelOptInfo *baserel, ParamPathInfo *param_info,
-			 int nworkers)
+			 RelOptInfo *baserel, ParamPathInfo *param_info)
 {
 	Cost		startup_cost = 0;
 	Cost		run_cost = 0;
@@ -205,6 +204,16 @@ cost_seqscan(Path *path, PlannerInfo *root,
 	else
 		path->rows = baserel->rows;
 
+	/*
+	 * Primitive parallel cost model.  Assume the leader will do half as much
+	 * work as a regular worker, because it will also need to read the tuples
+	 * returned by the workers when they percolate up to the gather ndoe.
+	 * This is almost certainly not exactly the right way to model this, so
+	 * this will probably need to be changed at some point...
+	 */
+	if (path->parallel_degree > 0)
+		path->rows = path->rows / (path->parallel_degree + 0.5);
+
 	if (!enable_seqscan)
 		startup_cost += disable_cost;
 
@@ -225,16 +234,6 @@ cost_seqscan(Path *path, PlannerInfo *root,
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
 	run_cost += cpu_per_tuple * baserel->tuples;
 
-	/*
-	 * Primitive parallel cost model.  Assume the leader will do half as much
-	 * work as a regular worker, because it will also need to read the tuples
-	 * returned by the workers when they percolate up to the gather ndoe.
-	 * This is almost certainly not exactly the right way to model this, so
-	 * this will probably need to be changed at some point...
-	 */
-	if (nworkers > 0)
-		run_cost = run_cost / (nworkers + 0.5);
-
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index a35c881..0c07622 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -34,6 +34,12 @@ static void sort_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 static void match_unsorted_outer(PlannerInfo *root, RelOptInfo *joinrel,
 					 RelOptInfo *outerrel, RelOptInfo *innerrel,
 					 JoinType jointype, JoinPathExtraData *extra);
+static void consider_parallel_nestloop(PlannerInfo *root,
+						   RelOptInfo *joinrel,
+						   RelOptInfo *outerrel,
+						   RelOptInfo *innerrel,
+						   JoinType jointype,
+						   JoinPathExtraData *extra);
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 					 RelOptInfo *outerrel, RelOptInfo *innerrel,
 					 JoinType jointype, JoinPathExtraData *extra);
@@ -263,7 +269,12 @@ add_paths_to_joinrel(PlannerInfo *root,
 												 jointype, &extra);
 
 	/*
-	 * 6. Finally, give extensions a chance to manipulate the path list.
+	 * 6. Consider gathering partial paths.
+	 */
+	generate_gather_paths(root, joinrel);
+
+	/*
+	 * 7. Finally, give extensions a chance to manipulate the path list.
 	 */
 	if (set_join_pathlist_hook)
 		set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
@@ -428,6 +439,62 @@ try_nestloop_path(PlannerInfo *root,
 }
 
 /*
+ * try_partial_nestloop_path
+ *	  Consider a partial nestloop join path; if it appears useful, push it into
+ *	  the joinrel's partial_pathlist via add_partial_path().
+ */
+static void
+try_partial_nestloop_path(PlannerInfo *root,
+				  RelOptInfo *joinrel,
+				  Path *outer_path,
+				  Path *inner_path,
+				  List *pathkeys,
+				  JoinType jointype,
+				  JoinPathExtraData *extra)
+{
+	JoinCostWorkspace workspace;
+
+	/*
+	 * If the inner path is parameterized, the parameterization must be fully
+	 * satisfied by the proposed outer path.  Parameterized partial paths are
+	 * not supported.  The caller should already have verified that no
+	 * extra_lateral_rels are required here.
+	 */
+	Assert(bms_is_empty(extra->extra_lateral_rels));
+	if (inner_path->param_info != NULL)
+	{
+		Relids		inner_paramrels = inner_path->param_info->ppi_req_outer;
+
+		if (!bms_is_subset(inner_paramrels, outer_path->parent->relids))
+			return;
+	}
+
+	/*
+	 * Before creating a path, get a quick lower bound on what it is likely
+	 * to cost.  Bail out right away if it looks terrible.
+	 */
+	initial_cost_nestloop(root, &workspace, jointype,
+						  outer_path, inner_path,
+						  extra->sjinfo, &extra->semifactors);
+	if (!add_partial_path_precheck(joinrel, workspace.total_cost, pathkeys))
+		return;
+
+	/* Might be good enough to be worth trying, so let's try it. */
+	add_partial_path(joinrel, (Path *)
+			 create_nestloop_path(root,
+								  joinrel,
+								  jointype,
+								  &workspace,
+								  extra->sjinfo,
+								  &extra->semifactors,
+								  outer_path,
+								  inner_path,
+								  extra->restrictlist,
+								  pathkeys,
+								  NULL));
+}
+
+/*
  * try_mergejoin_path
  *	  Consider a merge join path; if it appears useful, push it into
  *	  the joinrel's pathlist via add_path().
@@ -582,6 +649,62 @@ try_hashjoin_path(PlannerInfo *root,
 }
 
 /*
+ * try_partial_hashjoin_path
+ *	  Consider a partial hashjoin join path; if it appears useful, push it into
+ *	  the joinrel's partial_pathlist via add_partial_path().
+ */
+static void
+try_partial_hashjoin_path(PlannerInfo *root,
+						  RelOptInfo *joinrel,
+						  Path *outer_path,
+						  Path *inner_path,
+						  List *hashclauses,
+						  JoinType jointype,
+						  JoinPathExtraData *extra)
+{
+	JoinCostWorkspace workspace;
+
+	/*
+	 * If the inner path is parameterized, the parameterization must be fully
+	 * satisfied by the proposed outer path.  Parameterized partial paths are
+	 * not supported.  The caller should already have verified that no
+	 * extra_lateral_rels are required here.
+	 */
+	Assert(bms_is_empty(extra->extra_lateral_rels));
+	if (inner_path->param_info != NULL)
+	{
+		Relids		inner_paramrels = inner_path->param_info->ppi_req_outer;
+
+		if (!bms_is_empty(inner_paramrels))
+			return;
+	}
+
+	/*
+	 * Before creating a path, get a quick lower bound on what it is likely
+	 * to cost.  Bail out right away if it looks terrible.
+	 */
+	initial_cost_hashjoin(root, &workspace, jointype, hashclauses,
+						  outer_path, inner_path,
+						  extra->sjinfo, &extra->semifactors);
+	if (!add_partial_path_precheck(joinrel, workspace.total_cost, NIL))
+		return;
+
+	/* Might be good enough to be worth trying, so let's try it. */
+	add_partial_path(joinrel, (Path *)
+			 create_hashjoin_path(root,
+								  joinrel,
+								  jointype,
+								  &workspace,
+								  extra->sjinfo,
+								  &extra->semifactors,
+								  outer_path,
+								  inner_path,
+								  extra->restrictlist,
+								  NULL,
+								  hashclauses));
+}
+
+/*
  * clause_sides_match_join
  *	  Determine whether a join clause is of the right form to use in this join.
  *
@@ -1173,6 +1296,85 @@ match_unsorted_outer(PlannerInfo *root,
 				break;
 		}
 	}
+
+	/*
+	 * If the joinrel is parallel-safe and the join type supports nested loops,
+	 * we may be able to consider a partial nestloop plan.  However, we can't
+	 * handle JOIN_UNIQUE_OUTER, because the outer path will be partial, and
+	 * therefore we won't be able to properly guarantee uniqueness.  Nor can
+	 * we handle extra_lateral_rels, since partial paths must not be
+	 * parameterized.
+	 */
+	if (joinrel->consider_parallel && nestjoinOK &&
+		save_jointype != JOIN_UNIQUE_OUTER &&
+		bms_is_empty(extra->extra_lateral_rels))
+		consider_parallel_nestloop(root, joinrel, outerrel, innerrel,
+								   save_jointype, extra);
+}
+
+/*
+ * consider_parallel_nestloop
+ *	  Try to build partial paths for a joinrel by joining a partial path for the
+ *	  outer relation to a complete path for the inner relation.
+ *
+ * 'joinrel' is the join relation
+ * 'outerrel' is the outer join relation
+ * 'innerrel' is the inner join relation
+ * 'jointype' is the type of join to do
+ * 'extra' contains additional input values
+ */
+static void
+consider_parallel_nestloop(PlannerInfo *root,
+						   RelOptInfo *joinrel,
+						   RelOptInfo *outerrel,
+						   RelOptInfo *innerrel,
+						   JoinType jointype,
+						   JoinPathExtraData *extra)
+{
+	ListCell   *lc1;
+
+	foreach(lc1, outerrel->partial_pathlist)
+	{
+		Path	   *outerpath = (Path *) lfirst(lc1);
+		List	   *pathkeys;
+		ListCell   *lc2;
+
+		/* Figure out what useful ordering any paths we create will have. */
+		pathkeys = build_join_pathkeys(root, joinrel, jointype,
+									   outerpath->pathkeys);
+
+		/*
+		 * Try the cheapest parameterized paths; only those which will
+		 * produce an unparameterized path when joined to this outerrel
+		 * will survive try_partial_nestloop_path.  The cheapest
+		 * unparameterized path is also in this list.
+		 */
+		foreach(lc2, innerrel->cheapest_parameterized_paths)
+		{
+			Path	   *innerpath = (Path *) lfirst(lc2);
+
+			/* Can't join to an inner path that is not parallel-safe */
+			if (!innerpath->parallel_safe)
+				continue;
+
+			/*
+			 * Like match_unsorted_outer, we only consider a single nestloop
+			 * path when the jointype is JOIN_UNIQUE_INNER.  But we have to scan
+			 * cheapest_parameterized_paths to find the one we want to consider,
+			 * because cheapest_total_path might not be parallel-safe.
+			 */
+			if (jointype == JOIN_UNIQUE_INNER)
+			{
+				if (!bms_is_empty(PATH_REQ_OUTER(innerpath)))
+					continue;
+				innerpath = (Path *) create_unique_path(root, innerrel,
+											   innerpath, extra->sjinfo);
+			}
+
+			try_partial_nestloop_path(root, joinrel, outerpath, innerpath,
+									  pathkeys, jointype, extra);
+		}
+	}
 }
 
 /*
@@ -1350,6 +1552,55 @@ hash_inner_and_outer(PlannerInfo *root,
 				}
 			}
 		}
+
+		/*
+		 * If the joinrel is parallel-safe, we may be able to consider a
+		 * partial hash join.  However, we can't handle JOIN_UNIQUE_OUTER,
+		 * because the outer path will be partial, and therefore we won't be
+		 * able to properly guarantee uniqueness.  Also, the resulting path
+		 * must not be parameterized.
+		 */
+		if (joinrel->consider_parallel && jointype != JOIN_UNIQUE_OUTER &&
+			outerrel->partial_pathlist != NIL &&
+			bms_is_empty(extra->extra_lateral_rels))
+		{
+			Path   *cheapest_partial_outer;
+			Path   *cheapest_safe_inner = NULL;
+
+			cheapest_partial_outer =
+				(Path *) linitial(outerrel->partial_pathlist);
+
+			/*
+			 * Normally, given that the joinrel is parallel-safe, the cheapest
+			 * total inner path will also be parallel-safe, but if not, we'll
+			 * have to search cheapest_parameterized_paths for the cheapest
+			 * unparameterized inner path.
+			 */
+			if (cheapest_total_inner->parallel_safe)
+				cheapest_safe_inner = cheapest_total_inner;
+			else
+			{
+				ListCell   *lc;
+
+				foreach(lc, innerrel->cheapest_parameterized_paths)
+				{
+					Path	   *innerpath = (Path *) lfirst(lc);
+
+					if (innerpath->parallel_safe &&
+						bms_is_empty(PATH_REQ_OUTER(innerpath)))
+					{
+						cheapest_safe_inner = innerpath;
+						break;
+					}
+				}
+			}
+
+			if (cheapest_safe_inner != NULL)
+				try_partial_hashjoin_path(root, joinrel,
+										  cheapest_partial_outer,
+										  cheapest_safe_inner,
+										  hashclauses, jointype, extra);
+		}
 	}
 }
 
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index b2cc9f0..9b2b0b4 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1069,9 +1069,10 @@ mark_dummy_rel(RelOptInfo *rel)
 
 	/* Evict any previously chosen paths */
 	rel->pathlist = NIL;
+	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(rel, NIL, NULL));
+	add_path(rel, (Path *) create_append_path(rel, NIL, NULL, 0));
 
 	/* Set or update cheapest_total_path and related fields */
 	set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 411b36c..95d95f1 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1125,7 +1125,7 @@ create_gather_plan(PlannerInfo *root, GatherPath *best_path)
 
 	gather_plan = make_gather(subplan->targetlist,
 							  NIL,
-							  best_path->num_workers,
+							  best_path->path.parallel_degree,
 							  best_path->single_copy,
 							  subplan);
 
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index d73e7c0..e9f0538 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -84,7 +84,8 @@ query_planner(PlannerInfo *root, List *tlist,
 
 		/* The only path for it is a trivial Result path */
 		add_path(final_rel, (Path *)
-				 create_result_path((List *) parse->jointree->quals));
+				 create_result_path(final_rel,
+									(List *) parse->jointree->quals));
 
 		/* Select cheapest path (pretty easy in this case...) */
 		set_cheapest(final_rel);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 09c3244..6d8483e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -217,7 +217,12 @@ compare_path_costs_fuzzily(Path *path1, Path *path2, double fuzz_factor)
  * The cheapest_parameterized_paths list collects all parameterized paths
  * that have survived the add_path() tournament for this relation.  (Since
  * add_path ignores pathkeys for a parameterized path, these will be paths
- * that have best cost or best row count for their parameterization.)
+ * that have best cost or best row count for their parameterization.  We
+ * may also have both a parallel-safe and a non-parallel-safe path in some
+ * cases for the same parameterization in some cases, but this should be
+ * relatively rare since, most typically, all paths for the same relation
+ * will be paralell-safe or none of them will.)
+ *
  * cheapest_parameterized_paths always includes the cheapest-total
  * unparameterized path, too, if there is one; the users of that list find
  * it more convenient if that's included.
@@ -352,11 +357,12 @@ set_cheapest(RelOptInfo *parent_rel)
  *	  A path is worthy if it has a better sort order (better pathkeys) or
  *	  cheaper cost (on either dimension), or generates fewer rows, than any
  *	  existing path that has the same or superset parameterization rels.
+ *	  We also consider parallel-safe paths more worthy than others.
  *
  *	  We also remove from the rel's pathlist any old paths that are dominated
  *	  by new_path --- that is, new_path is cheaper, at least as well ordered,
- *	  generates no more rows, and requires no outer rels not required by the
- *	  old path.
+ *	  generates no more rows, requires no outer rels not required by the old
+ *	  path, and is no less parallel-safe.
  *
  *	  In most cases, a path with a superset parameterization will generate
  *	  fewer rows (since it has more join clauses to apply), so that those two
@@ -431,7 +437,6 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		PathCostComparison costcmp;
 		PathKeysComparison keyscmp;
 		BMS_Comparison outercmp;
-
 		p1_next = lnext(p1);
 
 		/*
@@ -470,14 +475,16 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 						{
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET1) &&
-								new_path->rows <= old_path->rows)
+								new_path->rows <= old_path->rows &&
+								new_path->parallel_safe >= old_path->parallel_safe)
 								remove_old = true;		/* new dominates old */
 						}
 						else if (keyscmp == PATHKEYS_BETTER2)
 						{
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET2) &&
-								new_path->rows >= old_path->rows)
+								new_path->rows >= old_path->rows &&
+								new_path->parallel_safe <= old_path->parallel_safe)
 								accept_new = false;		/* old dominates new */
 						}
 						else	/* keyscmp == PATHKEYS_EQUAL */
@@ -487,10 +494,10 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 								/*
 								 * Same pathkeys and outer rels, and fuzzily
 								 * the same cost, so keep just one; to decide
-								 * which, first check rows and then do a fuzzy
-								 * cost comparison with very small fuzz limit.
-								 * (We used to do an exact cost comparison,
-								 * but that results in annoying
+								 * which, first check parallel-safety, then rows,
+								 * then do a fuzzy cost comparison with very small
+								 * fuzz limit.  (We used to do an exact cost
+								 * comparison, but that results in annoying
 								 * platform-specific plan variations due to
 								 * roundoff in the cost estimates.)  If things
 								 * are still tied, arbitrarily keep only the
@@ -499,7 +506,13 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 								 * comparison decides the startup and total
 								 * costs compare differently.
 								 */
-								if (new_path->rows < old_path->rows)
+								if (new_path->parallel_safe >
+									old_path->parallel_safe)
+									remove_old = true;	/* new dominates old */
+								else if (new_path->parallel_safe <
+										 old_path->parallel_safe)
+									accept_new = false;	/* old dominates new */
+								else if (new_path->rows < old_path->rows)
 									remove_old = true;	/* new dominates old */
 								else if (new_path->rows > old_path->rows)
 									accept_new = false; /* old dominates new */
@@ -512,10 +525,12 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 														 * dominates new */
 							}
 							else if (outercmp == BMS_SUBSET1 &&
-									 new_path->rows <= old_path->rows)
+									 new_path->rows <= old_path->rows &&
+								 new_path->parallel_safe >= old_path->parallel_safe)
 								remove_old = true;		/* new dominates old */
 							else if (outercmp == BMS_SUBSET2 &&
-									 new_path->rows >= old_path->rows)
+									 new_path->rows >= old_path->rows &&
+								 new_path->parallel_safe <= old_path->parallel_safe)
 								accept_new = false;		/* old dominates new */
 							/* else different parameterizations, keep both */
 						}
@@ -527,7 +542,8 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 												   PATH_REQ_OUTER(old_path));
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET1) &&
-								new_path->rows <= old_path->rows)
+								new_path->rows <= old_path->rows &&
+								new_path->parallel_safe >= old_path->parallel_safe)
 								remove_old = true;		/* new dominates old */
 						}
 						break;
@@ -538,7 +554,8 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 												   PATH_REQ_OUTER(old_path));
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET2) &&
-								new_path->rows >= old_path->rows)
+								new_path->rows >= old_path->rows &&
+								new_path->parallel_safe <= old_path->parallel_safe)
 								accept_new = false;		/* old dominates new */
 						}
 						break;
@@ -685,6 +702,215 @@ add_path_precheck(RelOptInfo *parent_rel,
 	return true;
 }
 
+/*
+ * add_partial_path
+ *	  Like add_path, our goal here is to consider whether a path is worthy
+ *	  of being kept around, but the considerations here are a bit different.
+ *	  A partial path is one which can be executed in any number of workers in
+ *	  parallel such that each worker will generate a subset of the path's
+ *	  overall result.
+ *
+ *	  We don't generate parameterized partial paths for several reasons.  Most
+ *	  importantly, they're not safe to execute, because there's nothing to
+ *	  make sure that a parallel scan within the parameterized portion of the
+ *	  plan is running with the same value in every worker at the same time.
+ *	  Fortunately, it seems unlikely to be worthwhile anyway, because having
+ *	  each worker scan the entire outer relation and a subset of the inner
+ *	  relation will generally be a terrible plan.  The inner (parameterized)
+ *	  side of the plan will be small anyway.  There could be rare cases where
+ *	  this wins big - e.g. if join order constraints put a 1-row relation on
+ *	  the outer side of the topmost join with a parameterized plan on the inner
+ *	  side - but we'll have to be content not to handle such cases until somebody
+ *	  builds an executor infrastructure that can cope with them.
+ *
+ *	  Because we don't consider parameterized paths here, we also don't
+ *	  need to consider the row counts as a measure of quality: every path will
+ *	  produce the same number of rows.  Neither do we need to consider startup
+ *	  costs: parallelism is only used for plans that will be run to completion.
+ *	  Therefore, this routine is much simpler than add_path: it needs to
+ *	  consider only pathkeys and total cost.
+ */
+void
+add_partial_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	bool		accept_new = true;		/* unless we find a superior old path */
+	ListCell   *insert_after = NULL;	/* where to insert new item */
+	ListCell   *p1;
+	ListCell   *p1_prev;
+	ListCell   *p1_next;
+
+	/* Check for query cancel. */
+	CHECK_FOR_INTERRUPTS();
+
+	/*
+	 * As in add_path, throw out any paths which are dominated by the new path,
+	 * but throw out the new path if some existing path dominates it.
+	 */
+	p1_prev = NULL;
+	for (p1 = list_head(parent_rel->partial_pathlist); p1 != NULL;
+		 p1 = p1_next)
+	{
+		Path	   *old_path = (Path *) lfirst(p1);
+		bool		remove_old = false; /* unless new proves superior */
+		PathKeysComparison keyscmp;
+
+		p1_next = lnext(p1);
+
+		/* Compare pathkeys. */
+		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
+
+		/* Unless pathkeys are incompable, keep just one of the two paths. */
+		if (keyscmp != PATHKEYS_DIFFERENT)
+		{
+			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
+			{
+				/* New path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER1)
+					accept_new = false;
+			}
+			else if (old_path->total_cost > new_path->total_cost
+						* STD_FUZZ_FACTOR)
+			{
+				/* Old path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER2)
+					remove_old = true;
+			}
+			else if (keyscmp == PATHKEYS_BETTER1)
+			{
+				/* Costs are about the same, new path has better pathkeys. */
+				remove_old = true;
+			}
+			else if (keyscmp == PATHKEYS_BETTER2)
+			{
+				/* Costs are about the same, old path has better pathkeys. */
+				accept_new = false;
+			}
+			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
+			{
+				/* Pathkeys are the same, and the old path costs more. */
+				remove_old = true;
+			}
+			else
+			{
+				/*
+				 * Pathkeys are the same, and new path isn't materially
+				 * cheaper.
+				 */
+				accept_new = false;
+			}
+		}
+
+		/*
+		 * Remove current element from partial_pathlist if dominated by new.
+		 */
+		if (remove_old)
+		{
+			parent_rel->partial_pathlist =
+				list_delete_cell(parent_rel->partial_pathlist, p1, p1_prev);
+			/* add_path has a special case for IndexPath; we don't need it */
+			Assert(!IsA(old_path, IndexPath));
+			pfree(old_path);
+			/* p1_prev does not advance */
+		}
+		else
+		{
+			/* new belongs after this old path if it has cost >= old's */
+			if (new_path->total_cost >= old_path->total_cost)
+				insert_after = p1;
+			/* p1_prev advances */
+			p1_prev = p1;
+		}
+
+		/*
+		 * If we found an old path that dominates new_path, we can quit
+		 * scanning the partial_pathlist; we will not add new_path, and we
+		 * assume new_path cannot dominate any later path.
+		 */
+		if (!accept_new)
+			break;
+	}
+
+	if (accept_new)
+	{
+		/* Accept the new path: insert it at proper place */
+		if (insert_after)
+			lappend_cell(parent_rel->partial_pathlist, insert_after, new_path);
+		else
+			parent_rel->partial_pathlist =
+				lcons(new_path, parent_rel->partial_pathlist);
+	}
+	else
+	{
+		/* add_path has a special case for IndexPath; we don't need it */
+		Assert(!IsA(new_path, IndexPath));
+		/* Reject and recycle the new path */
+		pfree(new_path);
+	}
+}
+
+/*
+ * add_partial_path_precheck
+ *	  Check whether a proposed new partial path could possibly get accepted.
+ *
+ * Unlike add_path_precheck, we can ignore startup cost and parameterization,
+ * since they don't matter for partial paths (see add_partial_path).  But
+ * we do want to make sure we don't add a partial path if there's already
+ * a complete path that dominates it, since in that case the proposed path
+ * is surely a loser.
+ */
+bool
+add_partial_path_precheck(RelOptInfo *parent_rel, Cost total_cost,
+						  List *pathkeys)
+{
+	ListCell   *p1;
+
+	/*
+	 * Our goal here is twofold.  First, we want to find out whether this
+	 * path is clearly inferior to some existing partial path.  If so, we want
+	 * to reject it immediately.  Second, we want to find out whether this
+	 * path is clearly superior to some existing partial path -- at least,
+	 * modulo final cost computations.  If so, we definitely want to consider
+	 * it.
+	 *
+	 * Unlike add_path(), we always compare pathkeys here.  This is because
+	 * we expect partial_pathlist to be very short, and getting a definitive
+	 * answer at this stage avoids the need to call add_path_precheck.
+	 */
+	foreach(p1, parent_rel->partial_pathlist)
+	{
+		Path	   *old_path = (Path *) lfirst(p1);
+		PathKeysComparison keyscmp;
+
+		keyscmp = compare_pathkeys(pathkeys, old_path->pathkeys);
+		if (keyscmp != PATHKEYS_DIFFERENT)
+		{
+			if (total_cost > old_path->total_cost * STD_FUZZ_FACTOR &&
+				keyscmp != PATHKEYS_BETTER1)
+				return false;
+			if (old_path->total_cost > total_cost * STD_FUZZ_FACTOR &&
+				keyscmp != PATHKEYS_BETTER2)
+				return true;
+		}
+	}
+
+	/*
+	 * This path is neither clearly inferior to an existing partial path
+	 * nor clearly good enough that it might replace one.  Compare it to
+	 * non-parallel plans.  If it loses even before accounting for the cost
+	 * of the Gather node, we should definitely reject it.
+	 *
+	 * Note that we pass the total_cost to add_path_precheck twice.  This is
+	 * because it's never advantageous to consider the startup cost of a
+	 * partial path; the resulting plans, if run in parallel, will be run to
+	 * completion.
+	 */
+	if (!add_path_precheck(parent_rel, total_cost, total_cost, pathkeys,
+						   NULL))
+		return false;
+
+	return true;
+}
+
 
 /*****************************************************************************
  *		PATH NODE CREATION ROUTINES
@@ -697,7 +923,7 @@ add_path_precheck(RelOptInfo *parent_rel,
  */
 Path *
 create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
-					Relids required_outer, int nworkers)
+					Relids required_outer, int parallel_degree)
 {
 	Path	   *pathnode = makeNode(Path);
 
@@ -705,10 +931,12 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->parent = rel;
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
-	pathnode->parallel_aware = nworkers > 0 ? true : false;
+	pathnode->parallel_aware = parallel_degree > 0 ? true : false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = parallel_degree;
 	pathnode->pathkeys = NIL;	/* seqscan has unordered result */
 
-	cost_seqscan(pathnode, root, rel, pathnode->param_info, nworkers);
+	cost_seqscan(pathnode, root, rel, pathnode->param_info);
 
 	return pathnode;
 }
@@ -727,6 +955,8 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* samplescan has unordered result */
 
 	cost_samplescan(pathnode, root, rel, pathnode->param_info);
@@ -781,6 +1011,8 @@ create_index_path(PlannerInfo *root,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = pathkeys;
 
 	/* Convert clauses to indexquals the executor can handle */
@@ -827,6 +1059,8 @@ create_bitmap_heap_path(PlannerInfo *root,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->bitmapqual = bitmapqual;
@@ -853,6 +1087,8 @@ create_bitmap_and_path(PlannerInfo *root,
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = NULL;	/* not used in bitmap trees */
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->bitmapquals = bitmapquals;
@@ -878,6 +1114,8 @@ create_bitmap_or_path(PlannerInfo *root,
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = NULL;	/* not used in bitmap trees */
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->bitmapquals = bitmapquals;
@@ -903,6 +1141,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->tidquals = tidquals;
@@ -921,7 +1161,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
  * Note that we must handle subpaths = NIL, representing a dummy access path.
  */
 AppendPath *
-create_append_path(RelOptInfo *rel, List *subpaths, Relids required_outer)
+create_append_path(RelOptInfo *rel, List *subpaths, Relids required_outer,
+				   int parallel_degree)
 {
 	AppendPath *pathnode = makeNode(AppendPath);
 	ListCell   *l;
@@ -931,6 +1172,8 @@ create_append_path(RelOptInfo *rel, List *subpaths, Relids required_outer)
 	pathnode->path.param_info = get_appendrel_parampathinfo(rel,
 															required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = parallel_degree;
 	pathnode->path.pathkeys = NIL;		/* result is always considered
 										 * unsorted */
 	pathnode->subpaths = subpaths;
@@ -985,6 +1228,8 @@ create_merge_append_path(PlannerInfo *root,
 	pathnode->path.param_info = get_appendrel_parampathinfo(rel,
 															required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = pathkeys;
 	pathnode->subpaths = subpaths;
 
@@ -1052,7 +1297,7 @@ create_merge_append_path(PlannerInfo *root,
  *	  This is only used for the case of a query with an empty jointree.
  */
 ResultPath *
-create_result_path(List *quals)
+create_result_path(RelOptInfo *rel, List *quals)
 {
 	ResultPath *pathnode = makeNode(ResultPath);
 
@@ -1060,6 +1305,8 @@ create_result_path(List *quals)
 	pathnode->path.parent = NULL;
 	pathnode->path.param_info = NULL;	/* there are no other rels... */
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;
 	pathnode->quals = quals;
 
@@ -1094,6 +1341,8 @@ create_material_path(RelOptInfo *rel, Path *subpath)
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = subpath->param_info;
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
@@ -1155,6 +1404,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = subpath->param_info;
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 
 	/*
 	 * Assume the output is unsorted, since we don't necessarily have pathkeys
@@ -1328,19 +1579,30 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
  */
 GatherPath *
 create_gather_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
-				   Relids required_outer, int nworkers)
+				   Relids required_outer)
 {
 	GatherPath *pathnode = makeNode(GatherPath);
 
+	Assert(subpath->parallel_safe);
+
 	pathnode->path.pathtype = T_Gather;
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = false;
+	pathnode->path.parallel_degree = subpath->parallel_degree;
 	pathnode->path.pathkeys = NIL;		/* Gather has unordered result */
 
 	pathnode->subpath = subpath;
-	pathnode->num_workers = nworkers;
+	pathnode->single_copy = false;
+
+	if (pathnode->path.parallel_degree == 0)
+	{
+		pathnode->path.parallel_degree = 1;
+		pathnode->path.pathkeys = subpath->pathkeys;
+		pathnode->single_copy = true;
+	}
 
 	cost_gather(pathnode, root, rel, pathnode->path.param_info);
 
@@ -1393,6 +1655,8 @@ create_subqueryscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = pathkeys;
 
 	cost_subqueryscan(pathnode, root, rel, pathnode->param_info);
@@ -1416,6 +1680,8 @@ create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = pathkeys;
 
 	cost_functionscan(pathnode, root, rel, pathnode->param_info);
@@ -1439,6 +1705,8 @@ create_valuesscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* result is always unordered */
 
 	cost_valuesscan(pathnode, root, rel, pathnode->param_info);
@@ -1461,6 +1729,8 @@ create_ctescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* XXX for now, result is always unordered */
 
 	cost_ctescan(pathnode, root, rel, pathnode->param_info);
@@ -1484,6 +1754,8 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* result is always unordered */
 
 	/* Cost is the same as for a regular CTE scan */
@@ -1516,6 +1788,8 @@ create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.rows = rows;
 	pathnode->path.startup_cost = startup_cost;
 	pathnode->path.total_cost = total_cost;
@@ -1651,6 +1925,9 @@ create_nestloop_path(PlannerInfo *root,
 								  required_outer,
 								  &restrict_clauses);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = joinrel->consider_parallel;
+	/* This is a foolish way to estimate parallel_degree, but for now... */
+	pathnode->path.parallel_degree = outer_path->parallel_degree;
 	pathnode->path.pathkeys = pathkeys;
 	pathnode->jointype = jointype;
 	pathnode->outerjoinpath = outer_path;
@@ -1709,6 +1986,8 @@ create_mergejoin_path(PlannerInfo *root,
 								  required_outer,
 								  &restrict_clauses);
 	pathnode->jpath.path.parallel_aware = false;
+	pathnode->jpath.path.parallel_safe = joinrel->consider_parallel;
+	pathnode->jpath.path.parallel_degree = 0;
 	pathnode->jpath.path.pathkeys = pathkeys;
 	pathnode->jpath.jointype = jointype;
 	pathnode->jpath.outerjoinpath = outer_path;
@@ -1766,6 +2045,9 @@ create_hashjoin_path(PlannerInfo *root,
 								  required_outer,
 								  &restrict_clauses);
 	pathnode->jpath.path.parallel_aware = false;
+	pathnode->jpath.path.parallel_safe = joinrel->consider_parallel;
+	/* This is a foolish way to estimate parallel_degree, but for now... */
+	pathnode->jpath.path.parallel_degree = outer_path->parallel_degree;
 
 	/*
 	 * A hashjoin never has pathkeys, since its output ordering is
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 996b7fe..8d7ac48 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -107,6 +107,7 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptKind reloptkind)
 	rel->reltargetlist = NIL;
 	rel->pathlist = NIL;
 	rel->ppilist = NIL;
+	rel->partial_pathlist = NIL;
 	rel->cheapest_startup_path = NULL;
 	rel->cheapest_total_path = NULL;
 	rel->cheapest_unique_path = NULL;
@@ -369,6 +370,7 @@ build_join_rel(PlannerInfo *root,
 	joinrel->reltargetlist = NIL;
 	joinrel->pathlist = NIL;
 	joinrel->ppilist = NIL;
+	joinrel->partial_pathlist = NIL;
 	joinrel->cheapest_startup_path = NULL;
 	joinrel->cheapest_total_path = NULL;
 	joinrel->cheapest_unique_path = NULL;
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index f28e56c..52d3c81 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -63,6 +63,12 @@ typedef struct Instrumentation
 	BufferUsage bufusage;		/* Total buffer usage */
 } Instrumentation;
 
+typedef struct WorkerInstrumentation
+{
+	int			num_workers;	/* # of structures that follow */
+	Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+} WorkerInstrumentation;
+
 extern PGDLLIMPORT BufferUsage pgBufferUsage;
 
 extern Instrumentation *InstrAlloc(int n, int instrument_options);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index eb3591a..5ccf470 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1029,6 +1029,7 @@ typedef struct PlanState
 								 * top-level plan */
 
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
+	WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
 
 	/*
 	 * Common structural data for all Plan types.  These links to subsidiary
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 9a0dd28..3ebb9dc 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -458,6 +458,7 @@ typedef struct RelOptInfo
 	List	   *reltargetlist;	/* Vars to be output by scan of relation */
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
+	List	   *partial_pathlist;	/* partial Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -755,6 +756,8 @@ typedef struct Path
 	RelOptInfo *parent;			/* the relation this path can build */
 	ParamPathInfo *param_info;	/* parameterization info, or NULL if none */
 	bool		parallel_aware; /* engage parallel-aware logic? */
+	bool		parallel_safe;	/* OK to use as part of parallel plan? */
+	int			parallel_degree; /* desired parallel degree; 0 = not parallel */
 
 	/* estimated size/costs for path (see costsize.c for more info) */
 	double		rows;			/* estimated number of result tuples */
@@ -1057,7 +1060,6 @@ typedef struct GatherPath
 {
 	Path		path;
 	Path	   *subpath;		/* path for each worker */
-	int			num_workers;	/* number of workers sought to help */
 	bool		single_copy;	/* path must not be executed >1x */
 } GatherPath;
 
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac21a3a..25a7303 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -72,7 +72,7 @@ extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
-			 ParamPathInfo *param_info, int nworkers);
+			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index f28b4e2..93e0e4e 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -29,9 +29,12 @@ extern void add_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 				  Cost startup_cost, Cost total_cost,
 				  List *pathkeys, Relids required_outer);
+extern void add_partial_path(RelOptInfo *parent_rel, Path *new_path);
+extern bool add_partial_path_precheck(RelOptInfo *parent_rel,
+						  Cost total_cost, List *pathkeys);
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
-					Relids required_outer, int nworkers);
+					Relids required_outer, int parallel_degree);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
 extern IndexPath *create_index_path(PlannerInfo *root,
@@ -59,19 +62,18 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(RelOptInfo *rel, List *subpaths,
-				   Relids required_outer);
+				   Relids required_outer, int parallel_degree);
 extern MergeAppendPath *create_merge_append_path(PlannerInfo *root,
 						 RelOptInfo *rel,
 						 List *subpaths,
 						 List *pathkeys,
 						 Relids required_outer);
-extern ResultPath *create_result_path(List *quals);
+extern ResultPath *create_result_path(RelOptInfo *rel, List *quals);
 extern MaterialPath *create_material_path(RelOptInfo *rel, Path *subpath);
 extern UniquePath *create_unique_path(PlannerInfo *root, RelOptInfo *rel,
 				   Path *subpath, SpecialJoinInfo *sjinfo);
 extern GatherPath *create_gather_path(PlannerInfo *root,
-				   RelOptInfo *rel, Path *subpath, Relids required_outer,
-				   int nworkers);
+				   RelOptInfo *rel, Path *subpath, Relids required_outer);
 extern Path *create_subqueryscan_path(PlannerInfo *root, RelOptInfo *rel,
 						 List *pathkeys, Relids required_outer);
 extern Path *create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 87123a5..0007c28 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -50,6 +50,8 @@ extern RelOptInfo *make_one_rel(PlannerInfo *root, List *joinlist);
 extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 					 List *initial_rels);
 
+extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel);
+
 #ifdef OPTIMIZER_DEBUG
 extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#1)
Re: parallel joins, and better parallel explain

On 26 November 2015 at 03:41, Robert Haas <robertmhaas@gmail.com> wrote:

Attached find a patch that does (mostly) two things. First, it allows
the optimizer to generate plans where a Nested Loop or Hash Join
appears below a Gather node. This is a big improvement on what we
have today, where only a sequential scan can be parallelized; with
this patch, entire join problems can be parallelized, as long as they
don't need a Merge Join (see below for more on this).

Sounds like good progress.

This gives us multiple copies of the hash table, which means we must either
use N * work_mem, or we must limit the hash table to work_mem / N per
partial plan.

How can the partial paths share a hash table?

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#2)
Re: parallel joins, and better parallel explain

On Thu, Nov 26, 2015 at 3:45 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Sounds like good progress.

Thanks.

This gives us multiple copies of the hash table, which means we must either
use N * work_mem, or we must limit the hash table to work_mem / N per
partial plan.

We use N * work_mem in this patch. The other option would be a lot
more expensive in terms of planning time, because we'd have to
generate one set of hash join paths (or at least hash join costs) for
use in serial plans and another set for use in parallel plans. As it
is, the parallel stuff just piggybacks on the plans we're generating
anyway. We might have to change that at some point, but I think we'll
do well to put that point off as long as possible.

How can the partial paths share a hash table?

I'm glad you asked that question. For that, we need an allocator that
can work with dynamic shared memory segments, and a hash table built
on top of that allocator. It's relatively complicated because the
same DSM segments can be mapped at different addresses in different
processes, so you can't use native pointers. However, I'm pleased to
report that my colleague Thomas Munro is working on this problem, and
that he will submit this work to pgsql-hackers when it's ready, which
may be soon enough that we can consider including this in 9.6 if the
design meets with general approval.

As you may recall, I previously proposed an allocator framework,
somewhat of a WIP in progress at that time, and the reaction here was
a bit lukewarm, which is why I shifted from parallel sort to parallel
seq scan as a first project. I now think that was a good decision,
and as a result of Peter Geoghegan's work on sorting and my own
experience further developing the parallelism code, I no longer think
that allocator is the right solution for parallel sort anyway. But I
still think it might be the right solution for a parallel hash join
with a shared hash table.

My idea is that you'd end up with a plan like this:

Gather
-> Hash Join
-> Parallel Seq Scan
-> Parallel Hash
-> Parallel Seq Scan

Not only does this build only one copy of the hash table instead of N
copies, but we can parallelize the hash table construction itself by
having all workers insert in parallel, which is pretty cool. However,
I don't expect this to be better than an unshared hash table in all
cases. We have a fair amount of evidence that accessing
backend-private data structures can sometimes be much faster than
accessing shared data structures - cf. not only the syscaches and
relcache, but also the use of Materialize nodes by the planner in
certain cases even when there are no filtering quals underneath. So
there's probably going to be a need to consider both types of plans
and decide between them based on memory usage and expected runtime.
The details are not all clear to me yet, and most likely we'll have to
postpone some of the decisions until the code is written and can be
performance-tested to see in which cases one strategy performs better
or worse than the other. What I can confirm at this point is that
I've thought about the problem you're asking about here, and that
EnterpriseDB intends to contribute code to address it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Greg Stark
stark@mit.edu
In reply to: Robert Haas (#3)
Re: parallel joins, and better parallel explain

On Mon, Nov 30, 2015 at 4:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Not only does this build only one copy of the hash table instead of N
copies, but we can parallelize the hash table construction itself by
having all workers insert in parallel, which is pretty cool.

Hm. The case where you don't want parallel building of the hash table
might be substantially simpler. You could build a hash table and then
copy it into shared memory as single contiguous read-only data
structure optimized for lookups. I have an inkling that there are even
ways of marking the memory as being read-only and not needing cache
synchronization.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Robert Haas
robertmhaas@gmail.com
In reply to: Greg Stark (#4)
Re: parallel joins, and better parallel explain

On Mon, Nov 30, 2015 at 12:01 PM, Greg Stark <stark@mit.edu> wrote:

On Mon, Nov 30, 2015 at 4:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Not only does this build only one copy of the hash table instead of N
copies, but we can parallelize the hash table construction itself by
having all workers insert in parallel, which is pretty cool.

Hm. The case where you don't want parallel building of the hash table
might be substantially simpler. You could build a hash table and then
copy it into shared memory as single contiguous read-only data
structure optimized for lookups. I have an inkling that there are even
ways of marking the memory as being read-only and not needing cache
synchronization.

Yes, that's another approach that we could consider. I suspect it's
not really a lot better than the parallel-build case. If the inner
table is small, then it's probably best to have every worker build its
own unshared copy of the table rather than having one worker build the
table and everybody else wait, which might lead to stalls during the
build phase and additional traffic on the memory bus during the probe
phase (though, as you say, giving the kernel a hint could help in some
cases). If the inner table is big, then having everybody wait for a
single process to perform the build probably sucks.

But it's not impossible that there could be cases when it trumps every
other strategy. For example, if you're going to be doing a huge
number of probes, you could try building the hash table with several
different hash functions until you find one that is collision-free or
nearly so, and then use that one. The extra effort spent during the
build phase might speed up the probe phase enough to win. You can't
do that sort of thing so easily in a parallel build. Even apart from
that, if you build the hash table locally first and then copy it into
shared memory afterwards, you can free up any extra memory and use
only the minimal amount that you really need, which could be
beneficial in some cases. I'm just not sure that's appealing enough
to justify carrying a third system for building hash tables for hash
joins.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#1)
Re: parallel joins, and better parallel explain

On Thu, Nov 26, 2015 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Attached find a patch that does (mostly) two things.

I have started looking into this and would like to share few findings
with you:

-
+ /*
+ * Primitive parallel cost model.  Assume the leader will do half as much
+ * work as a
regular worker, because it will also need to read the tuples
+ * returned by the workers when they
percolate up to the gather ndoe.
+ * This is almost certainly not exactly the right way to model this,
so
+ * this will probably need to be changed at some point...
+ */
+ if (path->parallel_degree >
0)
+ path->rows = path->rows / (path->parallel_degree + 0.5);
+
  if (!enable_seqscan)

startup_cost += disable_cost;

@@ -225,16 +234,6 @@ cost_seqscan(Path *path, PlannerInfo *root,
cpu_per_tuple
= cpu_tuple_cost + qpqual_cost.per_tuple;
run_cost += cpu_per_tuple * baserel->tuples;

- /*
- *
Primitive parallel cost model. Assume the leader will do half as much
- * work as a regular worker, because
it will also need to read the tuples
- * returned by the workers when they percolate up to the gather
ndoe.
- * This is almost certainly not exactly the right way to model this, so
- * this will probably
need to be changed at some point...
- */
- if (nworkers > 0)
- run_cost = run_cost /
(nworkers + 0.5);
-
..

Above and changes in add_path() makes planner *not* to select parallel path
for seq scan where earlier it was possible. I think you want to change the
costing of parallel plans based on rows selected instead of total_cost,
but there seems to be some problem in the logic (I think gather node is not
taking into account the reduced cost). Consider below case:

create table tbl_parallel_test(c1 int, c2 char(1000));
insert into tbl_parallel_test values(generate_series(1,1000000),'aaaaa');
Analyze tbl_parallel_test;

set max_parallel_degree=6;
Explain select count(*) from tbl_parallel_test where c1 < 10000;

Without patch, it is able to use parallel plan for above case and
with patch, it is not able to use it.

- There seems to be some inconsistency in Explain's output when
multiple workers are used.

Case -1
Consider the table is populated as mentioned above.
change max_worker_processes=2 in postgresql.conf
set max_parallel_degree=4;

Explain (Analyze,Verbose) select count(*) from tbl_parallel_test where c1 <
10000;

QUERY PL
AN
--------------------------------------------------------------------------------
---------------------------------------------------------------------------
Aggregate (cost=46227.78..46227.79 rows=1 width=0) (actual
time=182583.554..18
2583.555 rows=1 loops=1)
Output: count(*)
-> Gather (cost=1000.00..46203.83 rows=9579 width=0) (actual
time=167930.03
9..182571.654 rows=9999 loops=1)
Output: c1, c2
Number of Workers: 4
-> Parallel Seq Scan on public.tbl_parallel_test
(cost=0.00..44245.93
rows=2129 width=0) (actual time=167904.516..182498.494 rows=3333 loops=3)
Output: c1, c2
Filter: (tbl_parallel_test.c1 < 10000)
Rows Removed by Filter: 330000
Worker 0: actual time=167890.584..182491.043 rows=4564
loops=1
Worker 1: actual time=167893.651..182461.904 rows=2740
loops=1
Planning time: 0.121 ms
Execution time: 182588.419 ms
(13 rows)

1. Rows removed by Filter should be 990001.
2. Total rows = 9999 at Gather node are right, but it is not obvious how
the rows by each worker and leader leads to that total.

Case-2

change max_worker_processes=8 in postgresql.conf
set max_parallel_degree=4;

postgres=# Explain (Analyze,Verbose) select count(*) from tbl_parallel_test
wher
e c1 < 10000;
QUERY
PLAN

--------------------------------------------------------------------------------
----------------------------------------------------------------------
Aggregate (cost=46227.78..46227.79 rows=1 width=0) (actual
time=39365.233..393
65.234 rows=1 loops=1)
Output: count(*)
-> Gather (cost=1000.00..46203.83 rows=9579 width=0) (actual
time=47.485..3
9344.574 rows=9999 loops=1)
Output: c1, c2
Number of Workers: 4
-> Parallel Seq Scan on public.tbl_parallel_test
(cost=0.00..44245.93
rows=2129 width=0) (actual time=11.910..39262.255 rows=2000 loops=5)
Output: c1, c2
Filter: (tbl_parallel_test.c1 < 10000)
Rows Removed by Filter: 198000
Worker 0: actual time=5.931..39249.068 rows=3143 loops=1
Worker 1: actual time=5.778..39254.504 rows=1687 loops=1
Worker 2: actual time=0.836..39264.683 rows=2170 loops=1
Worker 3: actual time=1.101..39251.459 rows=1715 loops=1
Planning time: 0.123 ms
Execution time: 39383.296 ms
(15 rows)

The problems reported in previous case are visible in this case as
well. I think both are due to same problem

Case -3
postgres=# Explain (Analyze,Verbose,Buffers) select count(*) from
tbl_parallel_t
est where c1 < 10000;
QUERY
PLAN

--------------------------------------------------------------------------------
----------------------------------------------------------------------
Aggregate (cost=46227.78..46227.79 rows=1 width=0) (actual
time=33607.146..336
07.147 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=548 read=142506
-> Gather (cost=1000.00..46203.83 rows=9579 width=0) (actual
time=33.983..3
3592.030 rows=9999 loops=1)
Output: c1, c2
Number of Workers: 4
Buffers: shared hit=548 read=142506
-> Parallel Seq Scan on public.tbl_parallel_test
(cost=0.00..44245.93
rows=2129 width=0) (actual time=13.447..33354.099 rows=2000 loops=5)
Output: c1, c2
Filter: (tbl_parallel_test.c1 < 10000)
Rows Removed by Filter: 198000
Buffers: shared hit=352 read=142506
Worker 0: actual time=18.422..33322.132 rows=2170 loops=1
Buffers: shared hit=4 read=30765
Worker 1: actual time=0.803..33283.979 rows=1890 loops=1
Buffers: shared hit=1 read=26679
Worker 2: actual time=0.711..33360.007 rows=1946 loops=1
Buffers: shared hit=197 read=30899
Worker 3: actual time=15.057..33252.605 rows=2145 loops=1
Buffers: shared hit=145 read=25433
Planning time: 0.217 ms
Execution time: 33612.964 ms
(22 rows)

I am not able to understand how buffer usage add upto what is
shown at Gather node.

- I think it would be better if we add some explanation to Explain -
Verbose section and an Example on the same page in documentation.
This can help users to understand this feature.

It would be better if we can split this patch into multiple patches like
Explain related changes, Append pushdown related changes, Join
Push down related changes. You can choose to push the patches as
you prefer, but splitting can certainly help in review/verification of the
code.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#6)
Re: parallel joins, and better parallel explain

On Tue, Dec 1, 2015 at 5:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 26, 2015 at 8:11 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

Attached find a patch that does (mostly) two things.

I have started looking into this and would like to share few findings
with you:

- There seems to be some inconsistency in Explain's output when
multiple workers are used.

Forgot to mention that I have changed code of cost_seqscan() by adding
below line, so that it can select parallel plan. This is just a temporary
change to verify Explain's output.

cost_seqscan()
{
..

run_cost = run_cost / (path->parallel_degree + 0.5);
..
}

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#8Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#6)
2 attachment(s)
Re: parallel joins, and better parallel explain

On Tue, Dec 1, 2015 at 7:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Above and changes in add_path() makes planner not to select parallel path
for seq scan where earlier it was possible. I think you want to change the
costing of parallel plans based on rows selected instead of total_cost,
but there seems to be some problem in the logic (I think gather node is not
taking into account the reduced cost).

Oops. The new version I've attached should fix this. The reason why
I needed to make a change there is because previously the number of
rows estimated for the Parallel Seq Scan was the total number of rows,
not the number of rows per worker. That doesn't really matter when
we're only doing Parallel Seq Scan, but if you push a join below the
Gather, then the cost of the join won't be computed correctly unless
the row count is the number of rows per worker.

- There seems to be some inconsistency in Explain's output when
multiple workers are used.

What is going on here is a bit confusing, but in fact I believe it to
be more correct than what we get with unpatched master. The problem
has to do with the way that the instrumentation counts loops, and the
way that EXPLAIN displays that information. In unpatched master,
InstrEndLoop() is not called before the worker instrumentation data is
aggregated to the leader. Therefore, each node under the Gather ends
up with a loop count of 1. Unless, of course, it was executed
multiple times in one of the workers, for example because it was on
the inside of a nested loop. In that case, it ends up with a loop
count equal to the number of times it was executed *minus the number
of workers*. Therefore, if there are 4 workers and a leader, and
between those 5 processes they executed the inner side of a nested
loop 1000 times, the final loop count is 996.

With the patch, the loop count is equal to the number of times that
the nodes were actually executed. Typically, this ends up being equal
to one more than the number of workers, because the leader executes it
and so do all the workers, but it can end up being less if not all
workers execute a particular node. Of course, it can also be more.
If the node is executed repeatedly, the final loop count is equal to
the total number of times that the node was executed across the leader
and all workers. So, in the above example, the inner side of a nested
loop would be 1000, not 996, which has the noteworthy advantage of
being correct.

What makes the output a tad confusing is that some but not all fields
in EXPLAIN output are shown as per loop values. The startup cost,
total cost, and row counts are divided by the number of iterations. I
have always thought this was a terrible idea: when EXPLAIN tells me
about a nested loop with an inner index scan, I want to know the TOTAL
time spent on that index scan and the TOTAL number of rows returned,
but what I get is the result of dividing those values by the number of
loops and rounded off to a number of decimal places that almost
entirely eliminate the possibility of extracting useful infromation
from the results. However, I expect to be told that other people
(especially Tom Lane) don't want to change this, and in any case if we
were going to change it I think that would properly be a separate
patch.

So the net result of this is that the times and row counts are
*averages* across all of the loop iterations. In the case of the
inner side of a nested loop, this means you can read the data just as
you would in a non-parallel plan. For nodes executed exactly once per
worker plus once in the master, the value displayed ends up being a
per-process average of the amount of time spent, and a per-process
average of the number of rows. On the other hand, values for buffers
are NOT divided by the loop count, so those values are absolute
totals. Once you understand this, I think the data is pretty easy to
read.

-> Gather (cost=1000.00..46203.83 rows=9579 width=0) (actual
time=33.983..3
3592.030 rows=9999 loops=1)
Output: c1, c2
Number of Workers: 4
Buffers: shared hit=548 read=142506
-> Parallel Seq Scan on public.tbl_parallel_test
(cost=0.00..44245.93
rows=2129 width=0) (actual time=13.447..33354.099 rows=2000 loops=5)
Output: c1, c2
Filter: (tbl_parallel_test.c1 < 10000)
Rows Removed by Filter: 198000
Buffers: shared hit=352 read=142506
Worker 0: actual time=18.422..33322.132 rows=2170 loops=1
Buffers: shared hit=4 read=30765
Worker 1: actual time=0.803..33283.979 rows=1890 loops=1
Buffers: shared hit=1 read=26679
Worker 2: actual time=0.711..33360.007 rows=1946 loops=1
Buffers: shared hit=197 read=30899
Worker 3: actual time=15.057..33252.605 rows=2145 loops=1
Buffers: shared hit=145 read=25433
Planning time: 0.217 ms
Execution time: 33612.964 ms
(22 rows)

I am not able to understand how buffer usage add upto what is
shown at Gather node.

It doesn't, of course. But I'm not sure it should, and I don't think
this patch has changed anything about that either way. The patch only
affects the nodes that run in the workers, and Gather doesn't.

- I think it would be better if we add some explanation to Explain -
Verbose section and an Example on the same page in documentation.
This can help users to understand this feature.

It would be better if we can split this patch into multiple patches like
Explain related changes, Append pushdown related changes, Join
Push down related changes. You can choose to push the patches as
you prefer, but splitting can certainly help in review/verification of the
code.

I don't think it really makes sense to split the append push-down
changes from the join push-down changes; those share a great deal of
code. But I've now split out the EXPLAIN changes. See attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel-explain-v2.patchtext/x-patch; charset=US-ASCII; name=parallel-explain-v2.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 183d3d9..12dae77 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -103,6 +103,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
 						   PlanState *planstate, ExplainState *es);
 static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
 static const char *explain_get_index_name(Oid indexId);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
 static void ExplainIndexScanDetails(Oid indexid, ScanDirection indexorderdir,
 						ExplainState *es);
 static void ExplainScanTarget(Scan *plan, ExplainState *es);
@@ -1437,108 +1438,73 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 	/* Show buffer usage */
 	if (es->buffers && planstate->instrument)
+		show_buffer_usage(es, &planstate->instrument->bufusage);
+
+	/* Show worker detail */
+	if (es->analyze && es->verbose && planstate->worker_instrument)
 	{
-		const BufferUsage *usage = &planstate->instrument->bufusage;
+		WorkerInstrumentation *w = planstate->worker_instrument;
+		bool		opened_group = false;
+		int			n;
 
-		if (es->format == EXPLAIN_FORMAT_TEXT)
+		for (n = 0; n < w->num_workers; ++n)
 		{
-			bool		has_shared = (usage->shared_blks_hit > 0 ||
-									  usage->shared_blks_read > 0 ||
-									  usage->shared_blks_dirtied > 0 ||
-									  usage->shared_blks_written > 0);
-			bool		has_local = (usage->local_blks_hit > 0 ||
-									 usage->local_blks_read > 0 ||
-									 usage->local_blks_dirtied > 0 ||
-									 usage->local_blks_written > 0);
-			bool		has_temp = (usage->temp_blks_read > 0 ||
-									usage->temp_blks_written > 0);
-			bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->blk_read_time) ||
-								 !INSTR_TIME_IS_ZERO(usage->blk_write_time));
+			Instrumentation *instrument = &w->instrument[n];
+			double		nloops = instrument->nloops;
+			double		startup_sec;
+			double		total_sec;
+			double		rows;
+
+			if (nloops <= 0)
+				continue;
+			startup_sec = 1000.0 * instrument->startup / nloops;
+			total_sec = 1000.0 * instrument->total / nloops;
+			rows = instrument->ntuples / nloops;
 
-			/* Show only positive counter values. */
-			if (has_shared || has_local || has_temp)
+			if (es->format == EXPLAIN_FORMAT_TEXT)
 			{
 				appendStringInfoSpaces(es->str, es->indent * 2);
-				appendStringInfoString(es->str, "Buffers:");
-
-				if (has_shared)
-				{
-					appendStringInfoString(es->str, " shared");
-					if (usage->shared_blks_hit > 0)
-						appendStringInfo(es->str, " hit=%ld",
-										 usage->shared_blks_hit);
-					if (usage->shared_blks_read > 0)
-						appendStringInfo(es->str, " read=%ld",
-										 usage->shared_blks_read);
-					if (usage->shared_blks_dirtied > 0)
-						appendStringInfo(es->str, " dirtied=%ld",
-										 usage->shared_blks_dirtied);
-					if (usage->shared_blks_written > 0)
-						appendStringInfo(es->str, " written=%ld",
-										 usage->shared_blks_written);
-					if (has_local || has_temp)
-						appendStringInfoChar(es->str, ',');
-				}
-				if (has_local)
+				appendStringInfo(es->str, "Worker %d: ", n);
+				if (es->timing)
+					appendStringInfo(es->str,
+							"actual time=%.3f..%.3f rows=%.0f loops=%.0f\n",
+								 startup_sec, total_sec, rows, nloops);
+				else
+					appendStringInfo(es->str,
+									 "actual rows=%.0f loops=%.0f\n",
+									 rows, nloops);
+				es->indent++;
+				if (es->buffers)
+					show_buffer_usage(es, &instrument->bufusage);
+				es->indent--;
+			}
+			else
+			{
+				if (!opened_group)
 				{
-					appendStringInfoString(es->str, " local");
-					if (usage->local_blks_hit > 0)
-						appendStringInfo(es->str, " hit=%ld",
-										 usage->local_blks_hit);
-					if (usage->local_blks_read > 0)
-						appendStringInfo(es->str, " read=%ld",
-										 usage->local_blks_read);
-					if (usage->local_blks_dirtied > 0)
-						appendStringInfo(es->str, " dirtied=%ld",
-										 usage->local_blks_dirtied);
-					if (usage->local_blks_written > 0)
-						appendStringInfo(es->str, " written=%ld",
-										 usage->local_blks_written);
-					if (has_temp)
-						appendStringInfoChar(es->str, ',');
+					ExplainOpenGroup("Workers", "Workers", false, es);
+					opened_group = true;
 				}
-				if (has_temp)
+				ExplainOpenGroup("Worker", NULL, true, es);
+				ExplainPropertyInteger("Worker Number", n, es);
+
+				if (es->timing)
 				{
-					appendStringInfoString(es->str, " temp");
-					if (usage->temp_blks_read > 0)
-						appendStringInfo(es->str, " read=%ld",
-										 usage->temp_blks_read);
-					if (usage->temp_blks_written > 0)
-						appendStringInfo(es->str, " written=%ld",
-										 usage->temp_blks_written);
+					ExplainPropertyFloat("Actual Startup Time", startup_sec, 3, es);
+					ExplainPropertyFloat("Actual Total Time", total_sec, 3, es);
 				}
-				appendStringInfoChar(es->str, '\n');
-			}
+				ExplainPropertyFloat("Actual Rows", rows, 0, es);
+				ExplainPropertyFloat("Actual Loops", nloops, 0, es);
 
-			/* As above, show only positive counter values. */
-			if (has_timing)
-			{
-				appendStringInfoSpaces(es->str, es->indent * 2);
-				appendStringInfoString(es->str, "I/O Timings:");
-				if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
-					appendStringInfo(es->str, " read=%0.3f",
-							  INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
-				if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
-					appendStringInfo(es->str, " write=%0.3f",
-							 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
-				appendStringInfoChar(es->str, '\n');
+				if (es->buffers)
+					show_buffer_usage(es, &instrument->bufusage);
+
+				ExplainCloseGroup("Worker", NULL, true, es);
 			}
 		}
-		else
-		{
-			ExplainPropertyLong("Shared Hit Blocks", usage->shared_blks_hit, es);
-			ExplainPropertyLong("Shared Read Blocks", usage->shared_blks_read, es);
-			ExplainPropertyLong("Shared Dirtied Blocks", usage->shared_blks_dirtied, es);
-			ExplainPropertyLong("Shared Written Blocks", usage->shared_blks_written, es);
-			ExplainPropertyLong("Local Hit Blocks", usage->local_blks_hit, es);
-			ExplainPropertyLong("Local Read Blocks", usage->local_blks_read, es);
-			ExplainPropertyLong("Local Dirtied Blocks", usage->local_blks_dirtied, es);
-			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
-			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
-			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
-			ExplainPropertyFloat("I/O Read Time", INSTR_TIME_GET_MILLISEC(usage->blk_read_time), 3, es);
-			ExplainPropertyFloat("I/O Write Time", INSTR_TIME_GET_MILLISEC(usage->blk_write_time), 3, es);
-		}
+
+		if (opened_group)
+			ExplainCloseGroup("Workers", "Workers", false, es);
 	}
 
 	/* Get ready to display the child plans */
@@ -2277,6 +2243,113 @@ explain_get_index_name(Oid indexId)
 }
 
 /*
+ * Show buffer usage details.
+ */
+static void
+show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+{
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		bool		has_shared = (usage->shared_blks_hit > 0 ||
+								  usage->shared_blks_read > 0 ||
+								  usage->shared_blks_dirtied > 0 ||
+								  usage->shared_blks_written > 0);
+		bool		has_local = (usage->local_blks_hit > 0 ||
+								 usage->local_blks_read > 0 ||
+								 usage->local_blks_dirtied > 0 ||
+								 usage->local_blks_written > 0);
+		bool		has_temp = (usage->temp_blks_read > 0 ||
+								usage->temp_blks_written > 0);
+		bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->blk_read_time) ||
+								 !INSTR_TIME_IS_ZERO(usage->blk_write_time));
+
+		/* Show only positive counter values. */
+		if (has_shared || has_local || has_temp)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfoString(es->str, "Buffers:");
+
+			if (has_shared)
+			{
+				appendStringInfoString(es->str, " shared");
+				if (usage->shared_blks_hit > 0)
+					appendStringInfo(es->str, " hit=%ld",
+									 usage->shared_blks_hit);
+				if (usage->shared_blks_read > 0)
+					appendStringInfo(es->str, " read=%ld",
+									 usage->shared_blks_read);
+				if (usage->shared_blks_dirtied > 0)
+					appendStringInfo(es->str, " dirtied=%ld",
+									 usage->shared_blks_dirtied);
+				if (usage->shared_blks_written > 0)
+					appendStringInfo(es->str, " written=%ld",
+									 usage->shared_blks_written);
+				if (has_local || has_temp)
+					appendStringInfoChar(es->str, ',');
+			}
+			if (has_local)
+			{
+				appendStringInfoString(es->str, " local");
+				if (usage->local_blks_hit > 0)
+					appendStringInfo(es->str, " hit=%ld",
+									 usage->local_blks_hit);
+				if (usage->local_blks_read > 0)
+					appendStringInfo(es->str, " read=%ld",
+									 usage->local_blks_read);
+				if (usage->local_blks_dirtied > 0)
+					appendStringInfo(es->str, " dirtied=%ld",
+									 usage->local_blks_dirtied);
+				if (usage->local_blks_written > 0)
+					appendStringInfo(es->str, " written=%ld",
+									 usage->local_blks_written);
+				if (has_temp)
+					appendStringInfoChar(es->str, ',');
+			}
+			if (has_temp)
+			{
+				appendStringInfoString(es->str, " temp");
+				if (usage->temp_blks_read > 0)
+					appendStringInfo(es->str, " read=%ld",
+									 usage->temp_blks_read);
+				if (usage->temp_blks_written > 0)
+					appendStringInfo(es->str, " written=%ld",
+									 usage->temp_blks_written);
+			}
+			appendStringInfoChar(es->str, '\n');
+		}
+
+		/* As above, show only positive counter values. */
+		if (has_timing)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfoString(es->str, "I/O Timings:");
+			if (!INSTR_TIME_IS_ZERO(usage->blk_read_time))
+				appendStringInfo(es->str, " read=%0.3f",
+						  INSTR_TIME_GET_MILLISEC(usage->blk_read_time));
+			if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
+				appendStringInfo(es->str, " write=%0.3f",
+						 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
+			appendStringInfoChar(es->str, '\n');
+		}
+	}
+	else
+	{
+		ExplainPropertyLong("Shared Hit Blocks", usage->shared_blks_hit, es);
+		ExplainPropertyLong("Shared Read Blocks", usage->shared_blks_read, es);
+		ExplainPropertyLong("Shared Dirtied Blocks", usage->shared_blks_dirtied, es);
+		ExplainPropertyLong("Shared Written Blocks", usage->shared_blks_written, es);
+		ExplainPropertyLong("Local Hit Blocks", usage->local_blks_hit, es);
+		ExplainPropertyLong("Local Read Blocks", usage->local_blks_read, es);
+		ExplainPropertyLong("Local Dirtied Blocks", usage->local_blks_dirtied, es);
+		ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
+		ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
+		ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+		ExplainPropertyFloat("I/O Read Time", INSTR_TIME_GET_MILLISEC(usage->blk_read_time), 3, es);
+		ExplainPropertyFloat("I/O Write Time", INSTR_TIME_GET_MILLISEC(usage->blk_write_time), 3, es);
+	}
+}
+
+/*
  * Add some additional details about an IndexScan or IndexOnlyScan
  */
 static void
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 6730037..30e6b3d 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -49,20 +49,18 @@
 #define PARALLEL_TUPLE_QUEUE_SIZE		65536
 
 /* DSM structure for accumulating per-PlanState instrumentation. */
-typedef struct SharedPlanStateInstrumentation
-{
-	int plan_node_id;
-	slock_t mutex;
-	Instrumentation	instr;
-} SharedPlanStateInstrumentation;
-
-/* DSM structure for accumulating per-PlanState instrumentation. */
 struct SharedExecutorInstrumentation
 {
 	int instrument_options;
-	int ps_ninstrument;			/* # of ps_instrument structures following */
-	SharedPlanStateInstrumentation ps_instrument[FLEXIBLE_ARRAY_MEMBER];
+	int instrument_offset;		/* offset of first Instrumentation struct */
+	int num_workers;							/* # of workers */
+	int num_plan_nodes;							/* # of plan nodes */
+	int plan_node_id[FLEXIBLE_ARRAY_MEMBER];	/* array of plan node IDs */
+	/* array of num_plan_nodes * num_workers Instrumentation objects follows */
 };
+#define GetInstrumentationArray(sei) \
+	(AssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
+	 (Instrumentation *) (((char *) sei) + sei->instrument_offset))
 
 /* Context object for ExecParallelEstimate. */
 typedef struct ExecParallelEstimateContext
@@ -196,18 +194,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 	if (planstate == NULL)
 		return false;
 
-	/* If instrumentation is enabled, initialize array slot for this node. */
+	/* If instrumentation is enabled, initialize slot for this node. */
 	if (d->instrumentation != NULL)
-	{
-		SharedPlanStateInstrumentation *instrumentation;
-
-		instrumentation = &d->instrumentation->ps_instrument[d->nnodes];
-		Assert(d->nnodes < d->instrumentation->ps_ninstrument);
-		instrumentation->plan_node_id = planstate->plan->plan_node_id;
-		SpinLockInit(&instrumentation->mutex);
-		InstrInit(&instrumentation->instr,
-				  d->instrumentation->instrument_options);
-	}
+		d->instrumentation->plan_node_id[d->nnodes] =
+			planstate->plan->plan_node_id;
 
 	/* Count this node. */
 	d->nnodes++;
@@ -307,6 +297,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
 	int			pstmt_len;
 	int			param_len;
 	int			instrumentation_len = 0;
+	int			instrument_offset = 0;
 
 	/* Allocate object for return value. */
 	pei = palloc0(sizeof(ParallelExecutorInfo));
@@ -364,8 +355,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
 	if (estate->es_instrument)
 	{
 		instrumentation_len =
-			offsetof(SharedExecutorInstrumentation, ps_instrument)
-			+ sizeof(SharedPlanStateInstrumentation) * e.nnodes;
+			offsetof(SharedExecutorInstrumentation, plan_node_id)
+			+ sizeof(int) * e.nnodes;
+		instrumentation_len = MAXALIGN(instrumentation_len);
+		instrument_offset = instrumentation_len;
+		instrumentation_len += sizeof(Instrumentation) * e.nnodes * nworkers;
 		shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 	}
@@ -407,9 +401,17 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
 	 */
 	if (estate->es_instrument)
 	{
+		Instrumentation *instrument;
+		int		i;
+
 		instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
 		instrumentation->instrument_options = estate->es_instrument;
-		instrumentation->ps_ninstrument = e.nnodes;
+		instrumentation->instrument_offset = instrument_offset;
+		instrumentation->num_workers = nworkers;
+		instrumentation->num_plan_nodes = e.nnodes;
+		instrument = GetInstrumentationArray(instrumentation);
+		for (i = 0; i < nworkers * e.nnodes; ++i)
+			InstrInit(&instrument[i], estate->es_instrument);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
 					   instrumentation);
 		pei->instrumentation = instrumentation;
@@ -444,20 +446,31 @@ static bool
 ExecParallelRetrieveInstrumentation(PlanState *planstate,
 						  SharedExecutorInstrumentation *instrumentation)
 {
+	Instrumentation *instrument;
 	int		i;
+	int		n;
+	int		ibytes;
 	int		plan_node_id = planstate->plan->plan_node_id;
-	SharedPlanStateInstrumentation *ps_instrument;
 
 	/* Find the instumentation for this node. */
-	for (i = 0; i < instrumentation->ps_ninstrument; ++i)
-		if (instrumentation->ps_instrument[i].plan_node_id == plan_node_id)
+	for (i = 0; i < instrumentation->num_plan_nodes; ++i)
+		if (instrumentation->plan_node_id[i] == plan_node_id)
 			break;
-	if (i >= instrumentation->ps_ninstrument)
+	if (i >= instrumentation->num_plan_nodes)
 		elog(ERROR, "plan node %d not found", plan_node_id);
 
-	/* No need to acquire the spinlock here; workers have exited already. */
-	ps_instrument = &instrumentation->ps_instrument[i];
-	InstrAggNode(planstate->instrument, &ps_instrument->instr);
+	/* Accumulate the statistics from all workers. */
+	instrument = GetInstrumentationArray(instrumentation);
+	instrument += i * instrumentation->num_workers;
+	for (n = 0; n < instrumentation->num_workers; ++n)
+		InstrAggNode(planstate->instrument, &instrument[n]);
+
+	/* Also store the per-worker detail. */
+	ibytes = instrumentation->num_workers * sizeof(Instrumentation);
+	planstate->worker_instrument =
+		palloc(offsetof(WorkerInstrumentation, instrument) + ibytes);
+	planstate->worker_instrument->num_workers = instrumentation->num_workers;
+	memcpy(&planstate->worker_instrument->instrument, instrument, ibytes);
 
 	return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
 								 instrumentation);
@@ -568,7 +581,9 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 {
 	int		i;
 	int		plan_node_id = planstate->plan->plan_node_id;
-	SharedPlanStateInstrumentation *ps_instrument;
+	Instrumentation *instrument;
+
+	InstrEndLoop(planstate->instrument);
 
 	/*
 	 * If we shuffled the plan_node_id values in ps_instrument into sorted
@@ -576,20 +591,21 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 	 * if we're pushing down sufficiently large plan trees.  For now, do it
 	 * the slow, dumb way.
 	 */
-	for (i = 0; i < instrumentation->ps_ninstrument; ++i)
-		if (instrumentation->ps_instrument[i].plan_node_id == plan_node_id)
+	for (i = 0; i < instrumentation->num_plan_nodes; ++i)
+		if (instrumentation->plan_node_id[i] == plan_node_id)
 			break;
-	if (i >= instrumentation->ps_ninstrument)
+	if (i >= instrumentation->num_plan_nodes)
 		elog(ERROR, "plan node %d not found", plan_node_id);
 
 	/*
-	 * There's one SharedPlanStateInstrumentation per plan_node_id, so we
-	 * must use a spinlock in case multiple workers report at the same time.
+	 * Add our statistics to the per-node, per-worker totals.  It's possible
+	 * that this could happen more than once if we relaunched workers.
 	 */
-	ps_instrument = &instrumentation->ps_instrument[i];
-	SpinLockAcquire(&ps_instrument->mutex);
-	InstrAggNode(&ps_instrument->instr, planstate->instrument);
-	SpinLockRelease(&ps_instrument->mutex);
+	instrument = GetInstrumentationArray(instrumentation);
+	instrument += i * instrumentation->num_workers;
+	Assert(IsParallelWorker());
+	Assert(ParallelWorkerNumber < instrumentation->num_workers);
+	InstrAggNode(&instrument[ParallelWorkerNumber], planstate->instrument);
 
 	return planstate_tree_walker(planstate, ExecParallelReportInstrumentation,
 								 instrumentation);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index f28e56c..52d3c81 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -63,6 +63,12 @@ typedef struct Instrumentation
 	BufferUsage bufusage;		/* Total buffer usage */
 } Instrumentation;
 
+typedef struct WorkerInstrumentation
+{
+	int			num_workers;	/* # of structures that follow */
+	Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+} WorkerInstrumentation;
+
 extern PGDLLIMPORT BufferUsage pgBufferUsage;
 
 extern Instrumentation *InstrAlloc(int n, int instrument_options);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index eb3591a..5ccf470 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1029,6 +1029,7 @@ typedef struct PlanState
 								 * top-level plan */
 
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
+	WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
 
 	/*
 	 * Common structural data for all Plan types.  These links to subsidiary
parallel-join-v2.patchtext/x-patch; charset=US-ASCII; name=parallel-join-v2.patchDownload
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 30e6b3d..5bc8eef 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -167,25 +167,25 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 	e->nnodes++;
 
 	/* Call estimators for parallel-aware nodes. */
-	switch (nodeTag(planstate))
+	if (planstate->plan->parallel_aware)
 	{
-		case T_SeqScanState:
-			ExecSeqScanEstimate((SeqScanState *) planstate,
-								e->pcxt);
-			break;
-		default:
-			break;
+		switch (nodeTag(planstate))
+		{
+			case T_SeqScanState:
+				ExecSeqScanEstimate((SeqScanState *) planstate,
+									e->pcxt);
+				break;
+			default:
+				break;
+		}
 	}
 
 	return planstate_tree_walker(planstate, ExecParallelEstimate, e);
 }
 
 /*
- * Ordinary plan nodes won't do anything here, but parallel-aware plan nodes
- * may need to initialize shared state in the DSM before parallel workers
- * are available.  They can allocate the space they previous estimated using
- * shm_toc_allocate, and add the keys they previously estimated using
- * shm_toc_insert, in each case targeting pcxt->toc.
+ * Initialize the dynamic shared memory segment that will be used to control
+ * parallel execution.
  */
 static bool
 ExecParallelInitializeDSM(PlanState *planstate,
@@ -202,15 +202,26 @@ ExecParallelInitializeDSM(PlanState *planstate,
 	/* Count this node. */
 	d->nnodes++;
 
-	/* Call initializers for parallel-aware plan nodes. */
-	switch (nodeTag(planstate))
+	/*
+	 * Call initializers for parallel-aware plan nodes.
+	 *
+	 * Ordinary plan nodes won't do anything here, but parallel-aware plan
+	 * nodes may need to initialize shared state in the DSM before parallel
+	 * workers are available.  They can allocate the space they previously
+	 * estimated using shm_toc_allocate, and add the keys they previously
+	 * estimated using shm_toc_insert, in each case targeting pcxt->toc.
+	 */
+	if (planstate->plan->parallel_aware)
 	{
-		case T_SeqScanState:
-			ExecSeqScanInitializeDSM((SeqScanState *) planstate,
-									 d->pcxt);
-			break;
-		default:
-			break;
+		switch (nodeTag(planstate))
+		{
+			case T_SeqScanState:
+				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
+										 d->pcxt);
+				break;
+			default:
+				break;
+		}
 	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeDSM, d);
@@ -623,13 +634,16 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 		return false;
 
 	/* Call initializers for parallel-aware plan nodes. */
-	switch (nodeTag(planstate))
+	if (planstate->plan->parallel_aware)
 	{
-		case T_SeqScanState:
-			ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
-			break;
-		default:
-			break;
+		switch (nodeTag(planstate))
+		{
+			case T_SeqScanState:
+				ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
+				break;
+			default:
+				break;
+		}
 	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeWorker, toc);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 012c14b..9099454 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1588,6 +1588,8 @@ _outPathInfo(StringInfo str, const Path *node)
 	else
 		_outBitmapset(str, NULL);
 	WRITE_BOOL_FIELD(parallel_aware);
+	WRITE_BOOL_FIELD(parallel_safe);
+	WRITE_INT_FIELD(parallel_degree);
 	WRITE_FLOAT_FIELD(rows, "%.0f");
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
@@ -1764,7 +1766,6 @@ _outGatherPath(StringInfo str, const GatherPath *node)
 	_outPathInfo(str, (const Path *) node);
 
 	WRITE_NODE_FIELD(subpath);
-	WRITE_INT_FIELD(num_workers);
 	WRITE_BOOL_FIELD(single_copy);
 }
 
@@ -1887,6 +1888,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_NODE_FIELD(reltargetlist);
 	WRITE_NODE_FIELD(pathlist);
 	WRITE_NODE_FIELD(ppilist);
+	WRITE_NODE_FIELD(partial_pathlist);
 	WRITE_NODE_FIELD(cheapest_startup_path);
 	WRITE_NODE_FIELD(cheapest_total_path);
 	WRITE_NODE_FIELD(cheapest_unique_path);
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 916a518..5019804 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -851,4 +851,57 @@ lateral reference.  (Perhaps now that that stuff works, we could relax the
 pullup restriction?)
 
 
--- bjm & tgl
+Parallel Query and Partial Paths
+--------------------------------
+
+Parallel query involves dividing up the work that needs to be performed
+either by an entire query or some portion of the query in such a way that
+some of that work can be done by one or more worker processes, which are
+called parallel workers.  Parallel workers are a subtype of dynamic
+background workers; see src/backend/access/transam/README.parallel for a
+fuller description.  Academic literature on parallel query suggests that
+that parallel execution strategies can be divided into essentially two
+categories: pipelined parallelism, where the execution of the query is
+divided into multiple stages and each stage is handled by a separate
+process; and partitioning parallelism, where the data is split between
+multiple processes and each process handles a subset of it.  The
+literature, however, suggests that gains from pipeline parallelism are
+often very limited due to the difficulty of avoiding pipeline stalls.
+Consequently, we do not currently attempt to generate query plans that
+use this technique.
+
+Instead, we focus on partitioning paralellism, which does not require
+that the underlying table be partitioned.  It only requires that (1)
+there is some method of dividing the data from at least one of the base
+tables involved in the relation across multiple processes, (2) allowing
+each process to handle its own portion of the data, and then (3)
+collecting the results.  Requirements (2) and (3) is satisfied by the
+executor node Gather, which launches any number of worker processes and
+executes its single child plan in all of them (and perhaps in the leader
+also, if the children aren't generating enough data to keep the leader
+busy).  Requirement (1) is handled by the SeqScan node: when invoked
+with parallel_aware = true, this node will, in effect, partition the
+table on a block by block basis, returning a subset of the tuples from
+the relation in each worker where that SeqScan is executed.  A similar
+scheme could be (and probably should be) implemented for bitmap heap
+scans.
+
+Just as we do for non-parallel access methods, we build Paths to
+represent access strategies that can be used in a parallel plan.  These
+are, in essence, the same strategies that are available in the
+non-parallel plan, but there is an important difference: a path that
+will run beneath a Gather node returns only a subset of the query
+results in each worker, not all of them.  To form a path that can
+actually be executed, the (rather large) cost of the Gather node must be
+accounted for.  For this reason among others, paths intended to run
+beneath a Gather node - which we call "partial" paths since they return
+only a subset of the results in each worker - must be kept separate from
+ordinary paths (see RelOptInfo's partial_pathlist and the function
+add_partial_path).
+
+One of the keys to making parallel query effective is to run as much of
+the query in parallel as possible.  Therefore, we expect it to generally
+be desirable to postpone the Gather stage until as near to the top of the
+plan as possible.  Expanding the range of cases in which more work can be
+pushed below the Gather (and costly them accurately) is likely to keep us
+busy for a long time to come.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 4516cd3..326de0c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -72,6 +72,7 @@ static void set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 				 Index rti, RangeTblEntry *rte);
 static void set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel,
 				   RangeTblEntry *rte);
+static void create_parallel_paths(PlannerInfo *root, RelOptInfo *rel);
 static void set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel,
 						  RangeTblEntry *rte);
 static bool function_rte_parallel_ok(RangeTblEntry *rte);
@@ -612,7 +613,6 @@ static void
 set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 {
 	Relids		required_outer;
-	int			parallel_threshold = 1000;
 
 	/*
 	 * We don't support pushing join clauses into the quals of a seqscan, but
@@ -624,39 +624,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer, 0));
 
-	/* Consider parallel sequential scan */
-	if (rel->consider_parallel && rel->pages > parallel_threshold &&
-		required_outer == NULL)
-	{
-		Path *path;
-		int parallel_degree = 1;
-
-		/*
-		 * Limit the degree of parallelism logarithmically based on the size
-		 * of the relation.  This probably needs to be a good deal more
-		 * sophisticated, but we need something here for now.
-		 */
-		while (rel->pages > parallel_threshold * 3 &&
-			   parallel_degree < max_parallel_degree)
-		{
-			parallel_degree++;
-			parallel_threshold *= 3;
-			if (parallel_threshold >= PG_INT32_MAX / 3)
-				break;
-		}
-
-		/*
-		 * Ideally we should consider postponing the gather operation until
-		 * much later, after we've pushed joins and so on atop the parallel
-		 * sequential scan path.  But we don't have the infrastructure for
-		 * that yet, so just do this for now.
-		 */
-		path = create_seqscan_path(root, rel, required_outer, parallel_degree);
-		path = (Path *)
-			create_gather_path(root, rel, path, required_outer,
-							   parallel_degree);
-		add_path(rel, path);
-	}
+	/* If appropriate, consider parallel sequential scan */
+	if (rel->consider_parallel && required_outer == NULL)
+		create_parallel_paths(root, rel);
 
 	/* Consider index scans */
 	create_index_paths(root, rel);
@@ -666,6 +636,54 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 }
 
 /*
+ * create_parallel_paths
+ *	  Build parallel access paths for a plain relation
+ */
+static void
+create_parallel_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int		parallel_threshold = 1000;
+	int		parallel_degree = 1;
+
+	/*
+	 * If this relation is too small to be worth a parallel scan, just return
+	 * without doing anything ... unless it's an inheritance child.  In that case,
+	 * we want to generate a parallel path here anyway.  It might not be worthwhile
+	 * just for this relation, but when combined with all of its inheritance siblings
+	 * it may well pay off.
+	 */
+	if (rel->pages < parallel_threshold && rel->reloptkind == RELOPT_BASEREL)
+		return;
+
+	/*
+	 * Limit the degree of parallelism logarithmically based on the size of the
+	 * relation.  This probably needs to be a good deal more sophisticated, but we
+	 * need something here for now.
+	 */
+	while (rel->pages > parallel_threshold * 3 &&
+		   parallel_degree < max_parallel_degree)
+	{
+		parallel_degree++;
+		parallel_threshold *= 3;
+		if (parallel_threshold >= PG_INT32_MAX / 3)
+			break;
+	}
+
+	/* Add an unordered partial path based on a parallel sequential scan. */
+	add_partial_path(rel, create_seqscan_path(root, rel, NULL, parallel_degree));
+
+	/*
+	 * If this is a baserel, consider gathering any partial paths we may have
+	 * just created.  If we gathered an inheritance child, we could end up
+	 * with a very large number of gather nodes, each trying to grab its own
+	 * pool of workers, so don't do this in that case.  Instead, we'll
+	 * consider gathering partial paths for the appendrel.
+	 */
+	if (rel->reloptkind == RELOPT_BASEREL)
+		generate_gather_paths(root, rel);
+}
+
+/*
  * set_tablesample_rel_size
  *	  Set size estimates for a sampled relation
  */
@@ -1039,6 +1057,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	List	   *live_childrels = NIL;
 	List	   *subpaths = NIL;
 	bool		subpaths_valid = true;
+	List	   *partial_subpaths = NIL;
+	bool		partial_subpaths_valid = true;
 	List	   *all_child_pathkeys = NIL;
 	List	   *all_child_outers = NIL;
 	ListCell   *l;
@@ -1093,6 +1113,13 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		else
 			subpaths_valid = false;
 
+		/* Same idea, but for a partial plan. */
+		if (childrel->partial_pathlist != NIL)
+			partial_subpaths = accumulate_append_subpath(partial_subpaths,
+									   linitial(childrel->partial_pathlist));
+		else
+			partial_subpaths_valid = false;
+
 		/*
 		 * Collect lists of all the available path orderings and
 		 * parameterizations for all the children.  We use these as a
@@ -1164,7 +1191,39 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 * if we have zero or one live subpath due to constraint exclusion.)
 	 */
 	if (subpaths_valid)
-		add_path(rel, (Path *) create_append_path(rel, subpaths, NULL));
+		add_path(rel, (Path *) create_append_path(rel, subpaths, NULL, 0));
+
+	/*
+	 * Consider an append of partial unordered, unparameterized partial paths.
+	 */
+	if (partial_subpaths_valid)
+	{
+		AppendPath *appendpath;
+		ListCell   *lc;
+		int			parallel_degree = 0;
+
+		/*
+		 * Decide what parallel degree to request for this append path.  For
+		 * now, we just use the maximum parallel degree of any member.  It
+		 * might be useful to use a higher number if the Append node were
+		 * smart enough to spread out the workers, but it currently isn't.
+		 */
+		foreach(lc, partial_subpaths)
+		{
+			Path	   *path = lfirst(lc);
+
+			parallel_degree = Max(parallel_degree, path->parallel_degree);
+		}
+		Assert(parallel_degree > 0);
+
+		/* Generate a partial append path. */
+		appendpath = create_append_path(rel, partial_subpaths, NULL,
+										parallel_degree);
+		add_partial_path(rel, (Path *) appendpath);
+
+		/* Consider gathering it. */
+		generate_gather_paths(root, rel);
+	}
 
 	/*
 	 * Also build unparameterized MergeAppend paths based on the collected
@@ -1214,7 +1273,7 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 		if (subpaths_valid)
 			add_path(rel, (Path *)
-					 create_append_path(rel, subpaths, required_outer));
+					 create_append_path(rel, subpaths, required_outer, 0));
 	}
 }
 
@@ -1440,8 +1499,9 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 
 	/* Discard any pre-existing paths; no further need for them */
 	rel->pathlist = NIL;
+	rel->partial_pathlist = NIL;
 
-	add_path(rel, (Path *) create_append_path(rel, NIL, NULL));
+	add_path(rel, (Path *) create_append_path(rel, NIL, NULL, 0));
 
 	/*
 	 * We set the cheapest path immediately, to ensure that IS_DUMMY_REL()
@@ -1844,6 +1904,36 @@ set_worktable_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 }
 
 /*
+ * generate_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather on
+ *		top of a partial path.
+ */
+void
+generate_gather_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	Path	   *cheapest_partial_path;
+	Path	   *simple_gather_path;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/*
+	 * The output of Gather is currently always unsorted, so there's only one
+	 * partial path of interest: the cheapest one.
+	 *
+	 * Eventually, we should have a Gather Merge operation that can merge
+	 * multiple tuple streams together while preserving their ordering.  We
+	 * could usefully generate such a path from each partial path that has
+	 * non-NIL pathkeys.
+	 */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+	simple_gather_path = (Path *)
+		create_gather_path(root, rel, cheapest_partial_path, NULL);
+	add_path(rel, simple_gather_path);
+}
+
+/*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
  *
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 990486c..1e76c03 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -186,24 +186,34 @@ clamp_row_est(double nrows)
  */
 void
 cost_seqscan(Path *path, PlannerInfo *root,
-			 RelOptInfo *baserel, ParamPathInfo *param_info,
-			 int nworkers)
+			 RelOptInfo *baserel, ParamPathInfo *param_info)
 {
 	Cost		startup_cost = 0;
 	Cost		run_cost = 0;
 	double		spc_seq_page_cost;
 	QualCost	qpqual_cost;
 	Cost		cpu_per_tuple;
+	double		parallel_divisor = 1;
 
 	/* Should only be applied to base relations */
 	Assert(baserel->relid > 0);
 	Assert(baserel->rtekind == RTE_RELATION);
 
+	/*
+	 * Primitive parallel cost model.  Assume the leader will do half as much
+	 * work as a regular worker, because it will also need to read the tuples
+	 * returned by the workers when they percolate up to the gather ndoe.
+	 * This is almost certainly not exactly the right way to model this, so
+	 * this will probably need to be changed at some point...
+	 */
+	if (path->parallel_degree > 0)
+		parallel_divisor = path->parallel_degree + 0.5;
+
 	/* Mark the path with the correct row estimate */
 	if (param_info)
-		path->rows = param_info->ppi_rows;
+		path->rows = param_info->ppi_rows / parallel_divisor;
 	else
-		path->rows = baserel->rows;
+		path->rows = baserel->rows / parallel_divisor;
 
 	if (!enable_seqscan)
 		startup_cost += disable_cost;
@@ -216,24 +226,14 @@ cost_seqscan(Path *path, PlannerInfo *root,
 	/*
 	 * disk costs
 	 */
-	run_cost += spc_seq_page_cost * baserel->pages;
+	run_cost += spc_seq_page_cost * baserel->pages / parallel_divisor;
 
 	/* CPU costs */
 	get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
 
 	startup_cost += qpqual_cost.startup;
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
-	run_cost += cpu_per_tuple * baserel->tuples;
-
-	/*
-	 * Primitive parallel cost model.  Assume the leader will do half as much
-	 * work as a regular worker, because it will also need to read the tuples
-	 * returned by the workers when they percolate up to the gather ndoe.
-	 * This is almost certainly not exactly the right way to model this, so
-	 * this will probably need to be changed at some point...
-	 */
-	if (nworkers > 0)
-		run_cost = run_cost / (nworkers + 0.5);
+	run_cost += cpu_per_tuple * baserel->tuples / parallel_divisor;
 
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index a35c881..0c07622 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -34,6 +34,12 @@ static void sort_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 static void match_unsorted_outer(PlannerInfo *root, RelOptInfo *joinrel,
 					 RelOptInfo *outerrel, RelOptInfo *innerrel,
 					 JoinType jointype, JoinPathExtraData *extra);
+static void consider_parallel_nestloop(PlannerInfo *root,
+						   RelOptInfo *joinrel,
+						   RelOptInfo *outerrel,
+						   RelOptInfo *innerrel,
+						   JoinType jointype,
+						   JoinPathExtraData *extra);
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 					 RelOptInfo *outerrel, RelOptInfo *innerrel,
 					 JoinType jointype, JoinPathExtraData *extra);
@@ -263,7 +269,12 @@ add_paths_to_joinrel(PlannerInfo *root,
 												 jointype, &extra);
 
 	/*
-	 * 6. Finally, give extensions a chance to manipulate the path list.
+	 * 6. Consider gathering partial paths.
+	 */
+	generate_gather_paths(root, joinrel);
+
+	/*
+	 * 7. Finally, give extensions a chance to manipulate the path list.
 	 */
 	if (set_join_pathlist_hook)
 		set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
@@ -428,6 +439,62 @@ try_nestloop_path(PlannerInfo *root,
 }
 
 /*
+ * try_partial_nestloop_path
+ *	  Consider a partial nestloop join path; if it appears useful, push it into
+ *	  the joinrel's partial_pathlist via add_partial_path().
+ */
+static void
+try_partial_nestloop_path(PlannerInfo *root,
+				  RelOptInfo *joinrel,
+				  Path *outer_path,
+				  Path *inner_path,
+				  List *pathkeys,
+				  JoinType jointype,
+				  JoinPathExtraData *extra)
+{
+	JoinCostWorkspace workspace;
+
+	/*
+	 * If the inner path is parameterized, the parameterization must be fully
+	 * satisfied by the proposed outer path.  Parameterized partial paths are
+	 * not supported.  The caller should already have verified that no
+	 * extra_lateral_rels are required here.
+	 */
+	Assert(bms_is_empty(extra->extra_lateral_rels));
+	if (inner_path->param_info != NULL)
+	{
+		Relids		inner_paramrels = inner_path->param_info->ppi_req_outer;
+
+		if (!bms_is_subset(inner_paramrels, outer_path->parent->relids))
+			return;
+	}
+
+	/*
+	 * Before creating a path, get a quick lower bound on what it is likely
+	 * to cost.  Bail out right away if it looks terrible.
+	 */
+	initial_cost_nestloop(root, &workspace, jointype,
+						  outer_path, inner_path,
+						  extra->sjinfo, &extra->semifactors);
+	if (!add_partial_path_precheck(joinrel, workspace.total_cost, pathkeys))
+		return;
+
+	/* Might be good enough to be worth trying, so let's try it. */
+	add_partial_path(joinrel, (Path *)
+			 create_nestloop_path(root,
+								  joinrel,
+								  jointype,
+								  &workspace,
+								  extra->sjinfo,
+								  &extra->semifactors,
+								  outer_path,
+								  inner_path,
+								  extra->restrictlist,
+								  pathkeys,
+								  NULL));
+}
+
+/*
  * try_mergejoin_path
  *	  Consider a merge join path; if it appears useful, push it into
  *	  the joinrel's pathlist via add_path().
@@ -582,6 +649,62 @@ try_hashjoin_path(PlannerInfo *root,
 }
 
 /*
+ * try_partial_hashjoin_path
+ *	  Consider a partial hashjoin join path; if it appears useful, push it into
+ *	  the joinrel's partial_pathlist via add_partial_path().
+ */
+static void
+try_partial_hashjoin_path(PlannerInfo *root,
+						  RelOptInfo *joinrel,
+						  Path *outer_path,
+						  Path *inner_path,
+						  List *hashclauses,
+						  JoinType jointype,
+						  JoinPathExtraData *extra)
+{
+	JoinCostWorkspace workspace;
+
+	/*
+	 * If the inner path is parameterized, the parameterization must be fully
+	 * satisfied by the proposed outer path.  Parameterized partial paths are
+	 * not supported.  The caller should already have verified that no
+	 * extra_lateral_rels are required here.
+	 */
+	Assert(bms_is_empty(extra->extra_lateral_rels));
+	if (inner_path->param_info != NULL)
+	{
+		Relids		inner_paramrels = inner_path->param_info->ppi_req_outer;
+
+		if (!bms_is_empty(inner_paramrels))
+			return;
+	}
+
+	/*
+	 * Before creating a path, get a quick lower bound on what it is likely
+	 * to cost.  Bail out right away if it looks terrible.
+	 */
+	initial_cost_hashjoin(root, &workspace, jointype, hashclauses,
+						  outer_path, inner_path,
+						  extra->sjinfo, &extra->semifactors);
+	if (!add_partial_path_precheck(joinrel, workspace.total_cost, NIL))
+		return;
+
+	/* Might be good enough to be worth trying, so let's try it. */
+	add_partial_path(joinrel, (Path *)
+			 create_hashjoin_path(root,
+								  joinrel,
+								  jointype,
+								  &workspace,
+								  extra->sjinfo,
+								  &extra->semifactors,
+								  outer_path,
+								  inner_path,
+								  extra->restrictlist,
+								  NULL,
+								  hashclauses));
+}
+
+/*
  * clause_sides_match_join
  *	  Determine whether a join clause is of the right form to use in this join.
  *
@@ -1173,6 +1296,85 @@ match_unsorted_outer(PlannerInfo *root,
 				break;
 		}
 	}
+
+	/*
+	 * If the joinrel is parallel-safe and the join type supports nested loops,
+	 * we may be able to consider a partial nestloop plan.  However, we can't
+	 * handle JOIN_UNIQUE_OUTER, because the outer path will be partial, and
+	 * therefore we won't be able to properly guarantee uniqueness.  Nor can
+	 * we handle extra_lateral_rels, since partial paths must not be
+	 * parameterized.
+	 */
+	if (joinrel->consider_parallel && nestjoinOK &&
+		save_jointype != JOIN_UNIQUE_OUTER &&
+		bms_is_empty(extra->extra_lateral_rels))
+		consider_parallel_nestloop(root, joinrel, outerrel, innerrel,
+								   save_jointype, extra);
+}
+
+/*
+ * consider_parallel_nestloop
+ *	  Try to build partial paths for a joinrel by joining a partial path for the
+ *	  outer relation to a complete path for the inner relation.
+ *
+ * 'joinrel' is the join relation
+ * 'outerrel' is the outer join relation
+ * 'innerrel' is the inner join relation
+ * 'jointype' is the type of join to do
+ * 'extra' contains additional input values
+ */
+static void
+consider_parallel_nestloop(PlannerInfo *root,
+						   RelOptInfo *joinrel,
+						   RelOptInfo *outerrel,
+						   RelOptInfo *innerrel,
+						   JoinType jointype,
+						   JoinPathExtraData *extra)
+{
+	ListCell   *lc1;
+
+	foreach(lc1, outerrel->partial_pathlist)
+	{
+		Path	   *outerpath = (Path *) lfirst(lc1);
+		List	   *pathkeys;
+		ListCell   *lc2;
+
+		/* Figure out what useful ordering any paths we create will have. */
+		pathkeys = build_join_pathkeys(root, joinrel, jointype,
+									   outerpath->pathkeys);
+
+		/*
+		 * Try the cheapest parameterized paths; only those which will
+		 * produce an unparameterized path when joined to this outerrel
+		 * will survive try_partial_nestloop_path.  The cheapest
+		 * unparameterized path is also in this list.
+		 */
+		foreach(lc2, innerrel->cheapest_parameterized_paths)
+		{
+			Path	   *innerpath = (Path *) lfirst(lc2);
+
+			/* Can't join to an inner path that is not parallel-safe */
+			if (!innerpath->parallel_safe)
+				continue;
+
+			/*
+			 * Like match_unsorted_outer, we only consider a single nestloop
+			 * path when the jointype is JOIN_UNIQUE_INNER.  But we have to scan
+			 * cheapest_parameterized_paths to find the one we want to consider,
+			 * because cheapest_total_path might not be parallel-safe.
+			 */
+			if (jointype == JOIN_UNIQUE_INNER)
+			{
+				if (!bms_is_empty(PATH_REQ_OUTER(innerpath)))
+					continue;
+				innerpath = (Path *) create_unique_path(root, innerrel,
+											   innerpath, extra->sjinfo);
+			}
+
+			try_partial_nestloop_path(root, joinrel, outerpath, innerpath,
+									  pathkeys, jointype, extra);
+		}
+	}
 }
 
 /*
@@ -1350,6 +1552,55 @@ hash_inner_and_outer(PlannerInfo *root,
 				}
 			}
 		}
+
+		/*
+		 * If the joinrel is parallel-safe, we may be able to consider a
+		 * partial hash join.  However, we can't handle JOIN_UNIQUE_OUTER,
+		 * because the outer path will be partial, and therefore we won't be
+		 * able to properly guarantee uniqueness.  Also, the resulting path
+		 * must not be parameterized.
+		 */
+		if (joinrel->consider_parallel && jointype != JOIN_UNIQUE_OUTER &&
+			outerrel->partial_pathlist != NIL &&
+			bms_is_empty(extra->extra_lateral_rels))
+		{
+			Path   *cheapest_partial_outer;
+			Path   *cheapest_safe_inner = NULL;
+
+			cheapest_partial_outer =
+				(Path *) linitial(outerrel->partial_pathlist);
+
+			/*
+			 * Normally, given that the joinrel is parallel-safe, the cheapest
+			 * total inner path will also be parallel-safe, but if not, we'll
+			 * have to search cheapest_parameterized_paths for the cheapest
+			 * unparameterized inner path.
+			 */
+			if (cheapest_total_inner->parallel_safe)
+				cheapest_safe_inner = cheapest_total_inner;
+			else
+			{
+				ListCell   *lc;
+
+				foreach(lc, innerrel->cheapest_parameterized_paths)
+				{
+					Path	   *innerpath = (Path *) lfirst(lc);
+
+					if (innerpath->parallel_safe &&
+						bms_is_empty(PATH_REQ_OUTER(innerpath)))
+					{
+						cheapest_safe_inner = innerpath;
+						break;
+					}
+				}
+			}
+
+			if (cheapest_safe_inner != NULL)
+				try_partial_hashjoin_path(root, joinrel,
+										  cheapest_partial_outer,
+										  cheapest_safe_inner,
+										  hashclauses, jointype, extra);
+		}
 	}
 }
 
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index b2cc9f0..9b2b0b4 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1069,9 +1069,10 @@ mark_dummy_rel(RelOptInfo *rel)
 
 	/* Evict any previously chosen paths */
 	rel->pathlist = NIL;
+	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(rel, NIL, NULL));
+	add_path(rel, (Path *) create_append_path(rel, NIL, NULL, 0));
 
 	/* Set or update cheapest_total_path and related fields */
 	set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 411b36c..95d95f1 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1125,7 +1125,7 @@ create_gather_plan(PlannerInfo *root, GatherPath *best_path)
 
 	gather_plan = make_gather(subplan->targetlist,
 							  NIL,
-							  best_path->num_workers,
+							  best_path->path.parallel_degree,
 							  best_path->single_copy,
 							  subplan);
 
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index d73e7c0..e9f0538 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -84,7 +84,8 @@ query_planner(PlannerInfo *root, List *tlist,
 
 		/* The only path for it is a trivial Result path */
 		add_path(final_rel, (Path *)
-				 create_result_path((List *) parse->jointree->quals));
+				 create_result_path(final_rel,
+									(List *) parse->jointree->quals));
 
 		/* Select cheapest path (pretty easy in this case...) */
 		set_cheapest(final_rel);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 09c3244..6d8483e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -217,7 +217,12 @@ compare_path_costs_fuzzily(Path *path1, Path *path2, double fuzz_factor)
  * The cheapest_parameterized_paths list collects all parameterized paths
  * that have survived the add_path() tournament for this relation.  (Since
  * add_path ignores pathkeys for a parameterized path, these will be paths
- * that have best cost or best row count for their parameterization.)
+ * that have best cost or best row count for their parameterization.  We
+ * may also have both a parallel-safe and a non-parallel-safe path in some
+ * cases for the same parameterization in some cases, but this should be
+ * relatively rare since, most typically, all paths for the same relation
+ * will be paralell-safe or none of them will.)
+ *
  * cheapest_parameterized_paths always includes the cheapest-total
  * unparameterized path, too, if there is one; the users of that list find
  * it more convenient if that's included.
@@ -352,11 +357,12 @@ set_cheapest(RelOptInfo *parent_rel)
  *	  A path is worthy if it has a better sort order (better pathkeys) or
  *	  cheaper cost (on either dimension), or generates fewer rows, than any
  *	  existing path that has the same or superset parameterization rels.
+ *	  We also consider parallel-safe paths more worthy than others.
  *
  *	  We also remove from the rel's pathlist any old paths that are dominated
  *	  by new_path --- that is, new_path is cheaper, at least as well ordered,
- *	  generates no more rows, and requires no outer rels not required by the
- *	  old path.
+ *	  generates no more rows, requires no outer rels not required by the old
+ *	  path, and is no less parallel-safe.
  *
  *	  In most cases, a path with a superset parameterization will generate
  *	  fewer rows (since it has more join clauses to apply), so that those two
@@ -431,7 +437,6 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		PathCostComparison costcmp;
 		PathKeysComparison keyscmp;
 		BMS_Comparison outercmp;
-
 		p1_next = lnext(p1);
 
 		/*
@@ -470,14 +475,16 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 						{
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET1) &&
-								new_path->rows <= old_path->rows)
+								new_path->rows <= old_path->rows &&
+								new_path->parallel_safe >= old_path->parallel_safe)
 								remove_old = true;		/* new dominates old */
 						}
 						else if (keyscmp == PATHKEYS_BETTER2)
 						{
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET2) &&
-								new_path->rows >= old_path->rows)
+								new_path->rows >= old_path->rows &&
+								new_path->parallel_safe <= old_path->parallel_safe)
 								accept_new = false;		/* old dominates new */
 						}
 						else	/* keyscmp == PATHKEYS_EQUAL */
@@ -487,10 +494,10 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 								/*
 								 * Same pathkeys and outer rels, and fuzzily
 								 * the same cost, so keep just one; to decide
-								 * which, first check rows and then do a fuzzy
-								 * cost comparison with very small fuzz limit.
-								 * (We used to do an exact cost comparison,
-								 * but that results in annoying
+								 * which, first check parallel-safety, then rows,
+								 * then do a fuzzy cost comparison with very small
+								 * fuzz limit.  (We used to do an exact cost
+								 * comparison, but that results in annoying
 								 * platform-specific plan variations due to
 								 * roundoff in the cost estimates.)  If things
 								 * are still tied, arbitrarily keep only the
@@ -499,7 +506,13 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 								 * comparison decides the startup and total
 								 * costs compare differently.
 								 */
-								if (new_path->rows < old_path->rows)
+								if (new_path->parallel_safe >
+									old_path->parallel_safe)
+									remove_old = true;	/* new dominates old */
+								else if (new_path->parallel_safe <
+										 old_path->parallel_safe)
+									accept_new = false;	/* old dominates new */
+								else if (new_path->rows < old_path->rows)
 									remove_old = true;	/* new dominates old */
 								else if (new_path->rows > old_path->rows)
 									accept_new = false; /* old dominates new */
@@ -512,10 +525,12 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 														 * dominates new */
 							}
 							else if (outercmp == BMS_SUBSET1 &&
-									 new_path->rows <= old_path->rows)
+									 new_path->rows <= old_path->rows &&
+								 new_path->parallel_safe >= old_path->parallel_safe)
 								remove_old = true;		/* new dominates old */
 							else if (outercmp == BMS_SUBSET2 &&
-									 new_path->rows >= old_path->rows)
+									 new_path->rows >= old_path->rows &&
+								 new_path->parallel_safe <= old_path->parallel_safe)
 								accept_new = false;		/* old dominates new */
 							/* else different parameterizations, keep both */
 						}
@@ -527,7 +542,8 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 												   PATH_REQ_OUTER(old_path));
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET1) &&
-								new_path->rows <= old_path->rows)
+								new_path->rows <= old_path->rows &&
+								new_path->parallel_safe >= old_path->parallel_safe)
 								remove_old = true;		/* new dominates old */
 						}
 						break;
@@ -538,7 +554,8 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 												   PATH_REQ_OUTER(old_path));
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET2) &&
-								new_path->rows >= old_path->rows)
+								new_path->rows >= old_path->rows &&
+								new_path->parallel_safe <= old_path->parallel_safe)
 								accept_new = false;		/* old dominates new */
 						}
 						break;
@@ -685,6 +702,215 @@ add_path_precheck(RelOptInfo *parent_rel,
 	return true;
 }
 
+/*
+ * add_partial_path
+ *	  Like add_path, our goal here is to consider whether a path is worthy
+ *	  of being kept around, but the considerations here are a bit different.
+ *	  A partial path is one which can be executed in any number of workers in
+ *	  parallel such that each worker will generate a subset of the path's
+ *	  overall result.
+ *
+ *	  We don't generate parameterized partial paths for several reasons.  Most
+ *	  importantly, they're not safe to execute, because there's nothing to
+ *	  make sure that a parallel scan within the parameterized portion of the
+ *	  plan is running with the same value in every worker at the same time.
+ *	  Fortunately, it seems unlikely to be worthwhile anyway, because having
+ *	  each worker scan the entire outer relation and a subset of the inner
+ *	  relation will generally be a terrible plan.  The inner (parameterized)
+ *	  side of the plan will be small anyway.  There could be rare cases where
+ *	  this wins big - e.g. if join order constraints put a 1-row relation on
+ *	  the outer side of the topmost join with a parameterized plan on the inner
+ *	  side - but we'll have to be content not to handle such cases until somebody
+ *	  builds an executor infrastructure that can cope with them.
+ *
+ *	  Because we don't consider parameterized paths here, we also don't
+ *	  need to consider the row counts as a measure of quality: every path will
+ *	  produce the same number of rows.  Neither do we need to consider startup
+ *	  costs: parallelism is only used for plans that will be run to completion.
+ *	  Therefore, this routine is much simpler than add_path: it needs to
+ *	  consider only pathkeys and total cost.
+ */
+void
+add_partial_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	bool		accept_new = true;		/* unless we find a superior old path */
+	ListCell   *insert_after = NULL;	/* where to insert new item */
+	ListCell   *p1;
+	ListCell   *p1_prev;
+	ListCell   *p1_next;
+
+	/* Check for query cancel. */
+	CHECK_FOR_INTERRUPTS();
+
+	/*
+	 * As in add_path, throw out any paths which are dominated by the new path,
+	 * but throw out the new path if some existing path dominates it.
+	 */
+	p1_prev = NULL;
+	for (p1 = list_head(parent_rel->partial_pathlist); p1 != NULL;
+		 p1 = p1_next)
+	{
+		Path	   *old_path = (Path *) lfirst(p1);
+		bool		remove_old = false; /* unless new proves superior */
+		PathKeysComparison keyscmp;
+
+		p1_next = lnext(p1);
+
+		/* Compare pathkeys. */
+		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
+
+		/* Unless pathkeys are incompable, keep just one of the two paths. */
+		if (keyscmp != PATHKEYS_DIFFERENT)
+		{
+			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
+			{
+				/* New path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER1)
+					accept_new = false;
+			}
+			else if (old_path->total_cost > new_path->total_cost
+						* STD_FUZZ_FACTOR)
+			{
+				/* Old path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER2)
+					remove_old = true;
+			}
+			else if (keyscmp == PATHKEYS_BETTER1)
+			{
+				/* Costs are about the same, new path has better pathkeys. */
+				remove_old = true;
+			}
+			else if (keyscmp == PATHKEYS_BETTER2)
+			{
+				/* Costs are about the same, old path has better pathkeys. */
+				accept_new = false;
+			}
+			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
+			{
+				/* Pathkeys are the same, and the old path costs more. */
+				remove_old = true;
+			}
+			else
+			{
+				/*
+				 * Pathkeys are the same, and new path isn't materially
+				 * cheaper.
+				 */
+				accept_new = false;
+			}
+		}
+
+		/*
+		 * Remove current element from partial_pathlist if dominated by new.
+		 */
+		if (remove_old)
+		{
+			parent_rel->partial_pathlist =
+				list_delete_cell(parent_rel->partial_pathlist, p1, p1_prev);
+			/* add_path has a special case for IndexPath; we don't need it */
+			Assert(!IsA(old_path, IndexPath));
+			pfree(old_path);
+			/* p1_prev does not advance */
+		}
+		else
+		{
+			/* new belongs after this old path if it has cost >= old's */
+			if (new_path->total_cost >= old_path->total_cost)
+				insert_after = p1;
+			/* p1_prev advances */
+			p1_prev = p1;
+		}
+
+		/*
+		 * If we found an old path that dominates new_path, we can quit
+		 * scanning the partial_pathlist; we will not add new_path, and we
+		 * assume new_path cannot dominate any later path.
+		 */
+		if (!accept_new)
+			break;
+	}
+
+	if (accept_new)
+	{
+		/* Accept the new path: insert it at proper place */
+		if (insert_after)
+			lappend_cell(parent_rel->partial_pathlist, insert_after, new_path);
+		else
+			parent_rel->partial_pathlist =
+				lcons(new_path, parent_rel->partial_pathlist);
+	}
+	else
+	{
+		/* add_path has a special case for IndexPath; we don't need it */
+		Assert(!IsA(new_path, IndexPath));
+		/* Reject and recycle the new path */
+		pfree(new_path);
+	}
+}
+
+/*
+ * add_partial_path_precheck
+ *	  Check whether a proposed new partial path could possibly get accepted.
+ *
+ * Unlike add_path_precheck, we can ignore startup cost and parameterization,
+ * since they don't matter for partial paths (see add_partial_path).  But
+ * we do want to make sure we don't add a partial path if there's already
+ * a complete path that dominates it, since in that case the proposed path
+ * is surely a loser.
+ */
+bool
+add_partial_path_precheck(RelOptInfo *parent_rel, Cost total_cost,
+						  List *pathkeys)
+{
+	ListCell   *p1;
+
+	/*
+	 * Our goal here is twofold.  First, we want to find out whether this
+	 * path is clearly inferior to some existing partial path.  If so, we want
+	 * to reject it immediately.  Second, we want to find out whether this
+	 * path is clearly superior to some existing partial path -- at least,
+	 * modulo final cost computations.  If so, we definitely want to consider
+	 * it.
+	 *
+	 * Unlike add_path(), we always compare pathkeys here.  This is because
+	 * we expect partial_pathlist to be very short, and getting a definitive
+	 * answer at this stage avoids the need to call add_path_precheck.
+	 */
+	foreach(p1, parent_rel->partial_pathlist)
+	{
+		Path	   *old_path = (Path *) lfirst(p1);
+		PathKeysComparison keyscmp;
+
+		keyscmp = compare_pathkeys(pathkeys, old_path->pathkeys);
+		if (keyscmp != PATHKEYS_DIFFERENT)
+		{
+			if (total_cost > old_path->total_cost * STD_FUZZ_FACTOR &&
+				keyscmp != PATHKEYS_BETTER1)
+				return false;
+			if (old_path->total_cost > total_cost * STD_FUZZ_FACTOR &&
+				keyscmp != PATHKEYS_BETTER2)
+				return true;
+		}
+	}
+
+	/*
+	 * This path is neither clearly inferior to an existing partial path
+	 * nor clearly good enough that it might replace one.  Compare it to
+	 * non-parallel plans.  If it loses even before accounting for the cost
+	 * of the Gather node, we should definitely reject it.
+	 *
+	 * Note that we pass the total_cost to add_path_precheck twice.  This is
+	 * because it's never advantageous to consider the startup cost of a
+	 * partial path; the resulting plans, if run in parallel, will be run to
+	 * completion.
+	 */
+	if (!add_path_precheck(parent_rel, total_cost, total_cost, pathkeys,
+						   NULL))
+		return false;
+
+	return true;
+}
+
 
 /*****************************************************************************
  *		PATH NODE CREATION ROUTINES
@@ -697,7 +923,7 @@ add_path_precheck(RelOptInfo *parent_rel,
  */
 Path *
 create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
-					Relids required_outer, int nworkers)
+					Relids required_outer, int parallel_degree)
 {
 	Path	   *pathnode = makeNode(Path);
 
@@ -705,10 +931,12 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->parent = rel;
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
-	pathnode->parallel_aware = nworkers > 0 ? true : false;
+	pathnode->parallel_aware = parallel_degree > 0 ? true : false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = parallel_degree;
 	pathnode->pathkeys = NIL;	/* seqscan has unordered result */
 
-	cost_seqscan(pathnode, root, rel, pathnode->param_info, nworkers);
+	cost_seqscan(pathnode, root, rel, pathnode->param_info);
 
 	return pathnode;
 }
@@ -727,6 +955,8 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* samplescan has unordered result */
 
 	cost_samplescan(pathnode, root, rel, pathnode->param_info);
@@ -781,6 +1011,8 @@ create_index_path(PlannerInfo *root,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = pathkeys;
 
 	/* Convert clauses to indexquals the executor can handle */
@@ -827,6 +1059,8 @@ create_bitmap_heap_path(PlannerInfo *root,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->bitmapqual = bitmapqual;
@@ -853,6 +1087,8 @@ create_bitmap_and_path(PlannerInfo *root,
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = NULL;	/* not used in bitmap trees */
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->bitmapquals = bitmapquals;
@@ -878,6 +1114,8 @@ create_bitmap_or_path(PlannerInfo *root,
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = NULL;	/* not used in bitmap trees */
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->bitmapquals = bitmapquals;
@@ -903,6 +1141,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->tidquals = tidquals;
@@ -921,7 +1161,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
  * Note that we must handle subpaths = NIL, representing a dummy access path.
  */
 AppendPath *
-create_append_path(RelOptInfo *rel, List *subpaths, Relids required_outer)
+create_append_path(RelOptInfo *rel, List *subpaths, Relids required_outer,
+				   int parallel_degree)
 {
 	AppendPath *pathnode = makeNode(AppendPath);
 	ListCell   *l;
@@ -931,6 +1172,8 @@ create_append_path(RelOptInfo *rel, List *subpaths, Relids required_outer)
 	pathnode->path.param_info = get_appendrel_parampathinfo(rel,
 															required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = parallel_degree;
 	pathnode->path.pathkeys = NIL;		/* result is always considered
 										 * unsorted */
 	pathnode->subpaths = subpaths;
@@ -985,6 +1228,8 @@ create_merge_append_path(PlannerInfo *root,
 	pathnode->path.param_info = get_appendrel_parampathinfo(rel,
 															required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = pathkeys;
 	pathnode->subpaths = subpaths;
 
@@ -1052,7 +1297,7 @@ create_merge_append_path(PlannerInfo *root,
  *	  This is only used for the case of a query with an empty jointree.
  */
 ResultPath *
-create_result_path(List *quals)
+create_result_path(RelOptInfo *rel, List *quals)
 {
 	ResultPath *pathnode = makeNode(ResultPath);
 
@@ -1060,6 +1305,8 @@ create_result_path(List *quals)
 	pathnode->path.parent = NULL;
 	pathnode->path.param_info = NULL;	/* there are no other rels... */
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;
 	pathnode->quals = quals;
 
@@ -1094,6 +1341,8 @@ create_material_path(RelOptInfo *rel, Path *subpath)
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = subpath->param_info;
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
@@ -1155,6 +1404,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = subpath->param_info;
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 
 	/*
 	 * Assume the output is unsorted, since we don't necessarily have pathkeys
@@ -1328,19 +1579,30 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
  */
 GatherPath *
 create_gather_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
-				   Relids required_outer, int nworkers)
+				   Relids required_outer)
 {
 	GatherPath *pathnode = makeNode(GatherPath);
 
+	Assert(subpath->parallel_safe);
+
 	pathnode->path.pathtype = T_Gather;
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = false;
+	pathnode->path.parallel_degree = subpath->parallel_degree;
 	pathnode->path.pathkeys = NIL;		/* Gather has unordered result */
 
 	pathnode->subpath = subpath;
-	pathnode->num_workers = nworkers;
+	pathnode->single_copy = false;
+
+	if (pathnode->path.parallel_degree == 0)
+	{
+		pathnode->path.parallel_degree = 1;
+		pathnode->path.pathkeys = subpath->pathkeys;
+		pathnode->single_copy = true;
+	}
 
 	cost_gather(pathnode, root, rel, pathnode->path.param_info);
 
@@ -1393,6 +1655,8 @@ create_subqueryscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = pathkeys;
 
 	cost_subqueryscan(pathnode, root, rel, pathnode->param_info);
@@ -1416,6 +1680,8 @@ create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = pathkeys;
 
 	cost_functionscan(pathnode, root, rel, pathnode->param_info);
@@ -1439,6 +1705,8 @@ create_valuesscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* result is always unordered */
 
 	cost_valuesscan(pathnode, root, rel, pathnode->param_info);
@@ -1461,6 +1729,8 @@ create_ctescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* XXX for now, result is always unordered */
 
 	cost_ctescan(pathnode, root, rel, pathnode->param_info);
@@ -1484,6 +1754,8 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* result is always unordered */
 
 	/* Cost is the same as for a regular CTE scan */
@@ -1516,6 +1788,8 @@ create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.rows = rows;
 	pathnode->path.startup_cost = startup_cost;
 	pathnode->path.total_cost = total_cost;
@@ -1651,6 +1925,9 @@ create_nestloop_path(PlannerInfo *root,
 								  required_outer,
 								  &restrict_clauses);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = joinrel->consider_parallel;
+	/* This is a foolish way to estimate parallel_degree, but for now... */
+	pathnode->path.parallel_degree = outer_path->parallel_degree;
 	pathnode->path.pathkeys = pathkeys;
 	pathnode->jointype = jointype;
 	pathnode->outerjoinpath = outer_path;
@@ -1709,6 +1986,8 @@ create_mergejoin_path(PlannerInfo *root,
 								  required_outer,
 								  &restrict_clauses);
 	pathnode->jpath.path.parallel_aware = false;
+	pathnode->jpath.path.parallel_safe = joinrel->consider_parallel;
+	pathnode->jpath.path.parallel_degree = 0;
 	pathnode->jpath.path.pathkeys = pathkeys;
 	pathnode->jpath.jointype = jointype;
 	pathnode->jpath.outerjoinpath = outer_path;
@@ -1766,6 +2045,9 @@ create_hashjoin_path(PlannerInfo *root,
 								  required_outer,
 								  &restrict_clauses);
 	pathnode->jpath.path.parallel_aware = false;
+	pathnode->jpath.path.parallel_safe = joinrel->consider_parallel;
+	/* This is a foolish way to estimate parallel_degree, but for now... */
+	pathnode->jpath.path.parallel_degree = outer_path->parallel_degree;
 
 	/*
 	 * A hashjoin never has pathkeys, since its output ordering is
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 996b7fe..8d7ac48 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -107,6 +107,7 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptKind reloptkind)
 	rel->reltargetlist = NIL;
 	rel->pathlist = NIL;
 	rel->ppilist = NIL;
+	rel->partial_pathlist = NIL;
 	rel->cheapest_startup_path = NULL;
 	rel->cheapest_total_path = NULL;
 	rel->cheapest_unique_path = NULL;
@@ -369,6 +370,7 @@ build_join_rel(PlannerInfo *root,
 	joinrel->reltargetlist = NIL;
 	joinrel->pathlist = NIL;
 	joinrel->ppilist = NIL;
+	joinrel->partial_pathlist = NIL;
 	joinrel->cheapest_startup_path = NULL;
 	joinrel->cheapest_total_path = NULL;
 	joinrel->cheapest_unique_path = NULL;
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 9a0dd28..3ebb9dc 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -458,6 +458,7 @@ typedef struct RelOptInfo
 	List	   *reltargetlist;	/* Vars to be output by scan of relation */
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
+	List	   *partial_pathlist;	/* partial Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -755,6 +756,8 @@ typedef struct Path
 	RelOptInfo *parent;			/* the relation this path can build */
 	ParamPathInfo *param_info;	/* parameterization info, or NULL if none */
 	bool		parallel_aware; /* engage parallel-aware logic? */
+	bool		parallel_safe;	/* OK to use as part of parallel plan? */
+	int			parallel_degree; /* desired parallel degree; 0 = not parallel */
 
 	/* estimated size/costs for path (see costsize.c for more info) */
 	double		rows;			/* estimated number of result tuples */
@@ -1057,7 +1060,6 @@ typedef struct GatherPath
 {
 	Path		path;
 	Path	   *subpath;		/* path for each worker */
-	int			num_workers;	/* number of workers sought to help */
 	bool		single_copy;	/* path must not be executed >1x */
 } GatherPath;
 
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac21a3a..25a7303 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -72,7 +72,7 @@ extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
-			 ParamPathInfo *param_info, int nworkers);
+			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index f28b4e2..93e0e4e 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -29,9 +29,12 @@ extern void add_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 				  Cost startup_cost, Cost total_cost,
 				  List *pathkeys, Relids required_outer);
+extern void add_partial_path(RelOptInfo *parent_rel, Path *new_path);
+extern bool add_partial_path_precheck(RelOptInfo *parent_rel,
+						  Cost total_cost, List *pathkeys);
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
-					Relids required_outer, int nworkers);
+					Relids required_outer, int parallel_degree);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
 extern IndexPath *create_index_path(PlannerInfo *root,
@@ -59,19 +62,18 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(RelOptInfo *rel, List *subpaths,
-				   Relids required_outer);
+				   Relids required_outer, int parallel_degree);
 extern MergeAppendPath *create_merge_append_path(PlannerInfo *root,
 						 RelOptInfo *rel,
 						 List *subpaths,
 						 List *pathkeys,
 						 Relids required_outer);
-extern ResultPath *create_result_path(List *quals);
+extern ResultPath *create_result_path(RelOptInfo *rel, List *quals);
 extern MaterialPath *create_material_path(RelOptInfo *rel, Path *subpath);
 extern UniquePath *create_unique_path(PlannerInfo *root, RelOptInfo *rel,
 				   Path *subpath, SpecialJoinInfo *sjinfo);
 extern GatherPath *create_gather_path(PlannerInfo *root,
-				   RelOptInfo *rel, Path *subpath, Relids required_outer,
-				   int nworkers);
+				   RelOptInfo *rel, Path *subpath, Relids required_outer);
 extern Path *create_subqueryscan_path(PlannerInfo *root, RelOptInfo *rel,
 						 List *pathkeys, Relids required_outer);
 extern Path *create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 87123a5..0007c28 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -50,6 +50,8 @@ extern RelOptInfo *make_one_rel(PlannerInfo *root, List *joinlist);
 extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 					 List *initial_rels);
 
+extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel);
+
 #ifdef OPTIMIZER_DEBUG
 extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#8)
1 attachment(s)
Re: parallel joins, and better parallel explain

On Thu, Dec 3, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Dec 1, 2015 at 7:21 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

- There seems to be some inconsistency in Explain's output when
multiple workers are used.

So the net result of this is that the times and row counts are
*averages* across all of the loop iterations. In the case of the
inner side of a nested loop, this means you can read the data just as
you would in a non-parallel plan. For nodes executed exactly once per
worker plus once in the master, the value displayed ends up being a
per-process average of the amount of time spent, and a per-process
average of the number of rows. On the other hand, values for buffers
are NOT divided by the loop count, so those values are absolute
totals. Once you understand this, I think the data is pretty easy to
read.

Okay, I am able to understand the results now.

Currently if the node doesn't get executed due to any reason we
use below code to display the required information:
else if (es->analyze)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
appendStringInfoString(es->str, " (never executed)");
else
{
if
(es->timing)
{
ExplainPropertyFloat("Actual Startup Time",
0.0, 3, es);
ExplainPropertyFloat("Actual Total Time", 0.0, 3, es);
}
ExplainPropertyFloat("Actual Rows", 0.0, 0, es);
ExplainPropertyFloat("Actual Loops", 0.0, 0, es);
}
}

Do you think it will be useful to display in a similar way if worker
is not able to execute plan (like before it starts execution, the other
workers have already finished the work)?

Other than that parallel-explain-v2.patch looks good.

-> Gather (cost=1000.00..46203.83 rows=9579 width=0) (actual
time=33.983..3
3592.030 rows=9999 loops=1)
Output: c1, c2
Number of Workers: 4
Buffers: shared hit=548 read=142506
-> Parallel Seq Scan on public.tbl_parallel_test
(cost=0.00..44245.93
rows=2129 width=0) (actual time=13.447..33354.099 rows=2000 loops=5)
Output: c1, c2
Filter: (tbl_parallel_test.c1 < 10000)
Rows Removed by Filter: 198000
Buffers: shared hit=352 read=142506
Worker 0: actual time=18.422..33322.132 rows=2170 loops=1
Buffers: shared hit=4 read=30765
Worker 1: actual time=0.803..33283.979 rows=1890 loops=1
Buffers: shared hit=1 read=26679
Worker 2: actual time=0.711..33360.007 rows=1946 loops=1
Buffers: shared hit=197 read=30899
Worker 3: actual time=15.057..33252.605 rows=2145 loops=1
Buffers: shared hit=145 read=25433
Planning time: 0.217 ms
Execution time: 33612.964 ms
(22 rows)

I am not able to understand how buffer usage add upto what is
shown at Gather node.

It doesn't, of course. But I'm not sure it should,

I think the problem is at Gather node, the number of buffers (read + hit)
are greater than the number of pages in relation. The reason why it
is doing so is that in Workers (ParallelQueryMain()), it starts the buffer
usage accumulation before ExecutorStart() whereas in master backend
it always calculate it for ExecutorRun() phase, on changing it to accumulate
for ExecutorRun() phase above problem is fixed. Attached patch fixes the
problem.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_parallel_bufusage_v1.patchapplication/octet-stream; name=fix_parallel_bufusage_v1.patchDownload
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 6730037..8002004 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -650,19 +650,21 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 		instrument_options = instrumentation->instrument_options;
 	queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
 
-	/* Prepare to track buffer usage during query execution. */
-	InstrStartParallelQuery();
-
 	/* Start up the executor, have it run the plan, and then shut it down. */
 	ExecutorStart(queryDesc, 0);
+
 	ExecParallelInitializeWorker(queryDesc->planstate, toc);
+	/* Prepare to track buffer usage during query execution. */
+	InstrStartParallelQuery();
+
 	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
-	ExecutorFinish(queryDesc);
 
 	/* Report buffer usage during parallel execution. */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE);
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber]);
 
+	ExecutorFinish(queryDesc);
+
 	/* Report instrumentation data if any instrumentation options are set. */
 	if (instrumentation != NULL)
 		ExecParallelReportInstrumentation(queryDesc->planstate,
#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#3)
Re: parallel joins, and better parallel explain

On 30 November 2015 at 17:52, Robert Haas <robertmhaas@gmail.com> wrote:

My idea is that you'd end up with a plan like this:

Gather
-> Hash Join
-> Parallel Seq Scan
-> Parallel Hash
-> Parallel Seq Scan

Not only does this build only one copy of the hash table instead of N
copies, but we can parallelize the hash table construction itself by
having all workers insert in parallel, which is pretty cool.

Hmm. If the hash table is small it should be OK to keep it locally. If its
larger, we need the shared copy. Seems like we can do the math on when to
use each kind of hash table.... build it in shmem and then copy locally if
needed.

Another way might to force the hash table into N batches, then give each
scan the task of handling one batch. That would allow a much larger hash
table to still be kept locally, moving the problem towards routing the data.

I'm not immediately convinced by the coolness of loading the hash table in
parallel. A whole class of bugs could be avoided if we choose not to, plus
the hash table is frequently so small a part of the HJ that its not going
to gain us very much.

The other way to look at what you've said here is that you don't seem to
have a way of building the hash table in only one process, which concerns
me.

What I can confirm at this point is that

I've thought about the problem you're asking about here, and that
EnterpriseDB intends to contribute code to address it.

Good

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#9)
Re: parallel joins, and better parallel explain

On Fri, Dec 4, 2015 at 3:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Do you think it will be useful to display in a similar way if worker
is not able to execute plan (like before it starts execution, the other
workers have already finished the work)?

Maybe, but it would clutter the output a good deal. I think we should
instead have the Gather node itself display the number of workers that
it actually managed to launch, and then the user can infer that any
execution nodes that don't mention those workers were not executed by
them.

Other than that parallel-explain-v2.patch looks good.

OK, I'm going to go ahead and commit that part.

I think the problem is at Gather node, the number of buffers (read + hit)
are greater than the number of pages in relation. The reason why it
is doing so is that in Workers (ParallelQueryMain()), it starts the buffer
usage accumulation before ExecutorStart() whereas in master backend
it always calculate it for ExecutorRun() phase, on changing it to accumulate
for ExecutorRun() phase above problem is fixed. Attached patch fixes the
problem.

Why is it a bad thing to capture the cost of doing ExecutorStart() in
the worker? I can see there's an argument that changing this would be
more consistent, but I'm not totally convinced. The overhead of
ExecutorStart() in the leader isn't attributable to any specific
worker, but the overhead of ExecutorStart() in the worker can fairly
be blamed on Gather, I think. I'm not dead set against this change
but I'm not totally convinced, either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#11)
Re: parallel joins, and better parallel explain

On Wed, Dec 9, 2015 at 11:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 4, 2015 at 3:07 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I think the problem is at Gather node, the number of buffers (read +

hit)

are greater than the number of pages in relation. The reason why it
is doing so is that in Workers (ParallelQueryMain()), it starts the

buffer

usage accumulation before ExecutorStart() whereas in master backend
it always calculate it for ExecutorRun() phase, on changing it to

accumulate

for ExecutorRun() phase above problem is fixed. Attached patch fixes

the

problem.

Why is it a bad thing to capture the cost of doing ExecutorStart() in
the worker? I can see there's an argument that changing this would be
more consistent, but I'm not totally convinced. The overhead of
ExecutorStart() in the leader isn't attributable to any specific
worker, but the overhead of ExecutorStart() in the worker can fairly
be blamed on Gather, I think.

This boils down to the question why currently buffer usage or other
similar stats (time for node execution) for a node doesn't include
the cost for ExecutorStart(). I think the reason is that ExecutorStart()
does some other miscellaneous works like accessing system tables
or initialization of nodes which we might not even execute, so
accumulating the cost for such work doesn't seems to be meaningful.
Looking at references of InstrStartNode() which actually marks the
beginning of buffer usage stats, it is clear that buffer usage stats are
not counted for ExecutorStart() phase, so I think following the same
for worker stats seems to be more accurate.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#8)
Re: parallel joins, and better parallel explain

On Thu, Dec 3, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Dec 1, 2015 at 7:21 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

It would be better if we can split this patch into multiple patches like
Explain related changes, Append pushdown related changes, Join
Push down related changes. You can choose to push the patches as
you prefer, but splitting can certainly help in review/verification of

the

code.

I don't think it really makes sense to split the append push-down
changes from the join push-down changes; those share a great deal of
code.

Not an issue. I have started looking into parallel join patch and below are
few findings:

1.
There are few compilation errors in the patch. It seems patch needs
to adapt the latest changes done in commit-edca44b1.

1>src/backend/optimizer/path/joinpath.c(420): error C2039:
'extra_lateral_rels' : is not a member of
'JoinPathExtraData'
1>
E:\WorkSpace\PostgreSQL\master\postgresql\src\include\nodes/relation.h(1727)
: see declaration of
'JoinPathExtraData'
..
..

2.
Why consider_parallel_nestloop() doesn't consider materializing inner
relation as we do in match_unsorted_outer()?

I have generated a test as below where non-parallel Nestloop join is
faster than parallel Nestloop join. I am using 'hydra' for testing this
patch.

CREATE TABLE t1(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 10000000) g;

CREATE TABLE t2(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 2000000) g;

Analyze t1;
Analyze t2;

Restart Server
Connect with psql

set enable_hashjoin=off;
set enable_mergejoin=off;
postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1
AND t1.c1 BETWEEN 100000 AND
100100;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------
--------------
Aggregate (cost=3294864.21..3294864.21 rows=1 width=0) (actual
time=42614.102..42614.102 rows=1 loops=1)
-> Nested Loop (cost=0.00..3294864.16 rows=20 width=0) (actual
time=4123.463..42614.084 rows=101
loops=1)
Join Filter: (t1.c1 = t2.c1)
Rows Removed by Join Filter: 201999899
-> Seq Scan on t2 (cost=0.00..30811.00 rows=2000000 width=4)
(actual time=0.027..284.979
rows=2000000 loops=1)
-> Materialize (cost=0.00..204053.41 rows=102 width=4) (actual
time=0.000..0.008 rows=101
loops=2000000)
-> Seq Scan on t1 (cost=0.00..204052.90 rows=102 width=4)
(actual time=13.920..2024.684
rows=101 loops=1)
Filter: ((c1 >= 100000) AND (c1 <= 100100))
Rows Removed by Filter: 9999899
Planning time: 0.085 ms
Execution time: 42614.135 ms

I have repeated the above statement 3 times and the above result is
median of 3 runs.

Restart Server
Connect with psql

set enable_hashjoin=off;
set enable_mergejoin=off;

set max_parallel_degree=4;

Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1
BETWEEN 100000 AND 100100;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------
---------------------
Aggregate (cost=1311396.47..1311396.48 rows=1 width=0) (actual
time=45736.973..45736.973 rows=1 loops=1)
-> Gather (cost=1000.00..1311396.42 rows=20 width=0) (actual
time=709.083..45736.925 rows=101 loops=1)
Number of Workers: 4
-> Nested Loop (cost=0.00..1310394.42 rows=20 width=0) (actual
time=436.460..11240.321 rows=20
loops=5)
Join Filter: (t1.c1 = t2.c1)
Rows Removed by Join Filter: 40399980
-> Parallel Seq Scan on t1 (cost=0.00..45345.09 rows=23
width=4) (actual
time=425.178..425.232 rows=20 loops=5)
Filter: ((c1 >= 100000) AND (c1 <= 100100))
Rows Removed by Filter: 1999980
-> Seq Scan on t2 (cost=0.00..30811.00 rows=2000000
width=4) (actual time=0.011..270.986
rows=2000000 loops=101)
Planning time: 0.115 ms
Execution time: 45737.863 ms

I have repeated the above statement 3 times and the above result is
median of 3 runs.

Now here the point to observe is that non-parallel case uses both less
Execution time and Planning time to complete the statement. There
is a considerable increase in planning time without any benefit in
execution.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#14Paul Ramsey
pramsey@cleverelephant.ca
In reply to: Robert Haas (#8)
Re: parallel joins, and better parallel explain

On Wed, Dec 2, 2015 at 1:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Oops. The new version I've attached should fix this.

I've been trying to see if parallel join has any effect on PostGIS
spatial join queries, which are commonly CPU bound. (My tests [1]https://gist.github.com/pramsey/84e7a39d83cccae692f8 on
simple parallel scan were very positive, though quite limited in that
they only parallelized such a small part of the work).

Like Amit, I found the current patches are missing a change to
src/include/nodes/relation.h, but just adding in "Relids
extra_lateral_rels" to JoinPathExtraData allowed a warning-free build.

The assumptions on parallel code in generally are that setting up
parallel workers is very costly compared to the work to be done, so to
get PostGIS code to parallelize I've been reduced to shoving
parallel_setup_cost very low (1) and parallel_tuple_cost as well.
Otherwise I just end up with ordinary plans. I did redefine all the
relevant functions as "parallel safe" and upped their declared costs
significantly.

I set up a 8000 record spatial table, with a spatial index, and did a
self-join on it.

explain analyze
select a.gid, b.gid from vada a join vada b
on st_intersects(a.geom, b.geom)
where a.gid != b.gid;

With no parallelism, I got this:

set max_parallel_degree = 0;

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.15..227332.48 rows=1822243 width=8) (actual
time=0.377..5528.461 rows=52074 loops=1)
-> Seq Scan on vada a (cost=0.00..1209.92 rows=8792 width=1189)
(actual time=0.027..5.004 rows=8792 loops=1)
-> Index Scan using vada_gix on vada b (cost=0.15..25.71 rows=1
width=1189) (actual time=0.351..0.622 rows=6 loops=8792)
Index Cond: (a.geom && geom)
Filter: ((a.gid <> gid) AND _st_intersects(a.geom, geom))
Rows Removed by Filter: 3
Planning time: 3.976 ms
Execution time: 5533.573 ms

With parallelism, I got this:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.15..226930.05 rows=1822243 width=8) (actual
time=0.840..5462.029 rows=52074 loops=1)
-> Gather (cost=0.00..807.49 rows=8792 width=1189) (actual
time=0.335..39.326 rows=8792 loops=1)
Number of Workers: 1
-> Parallel Seq Scan on vada a (cost=0.00..806.61 rows=5861
width=1189) (actual time=0.015..10.167 rows=4396 loops=2)
-> Index Scan using vada_gix on vada b (cost=0.15..25.71 rows=1
width=1189) (actual time=0.353..0.609 rows=6 loops=8792)
Index Cond: (a.geom && geom)
Filter: ((a.gid <> gid) AND _st_intersects(a.geom, geom))
Rows Removed by Filter: 3
Planning time: 4.019 ms
Execution time: 5467.126 ms

Given that it's a CPU-bound process, I was hoping for something closer
to the results of the sequence tests, about 50% time reduction, based
on the two cores in my test machine.

In general either the parallel planner is way too conservative (it
seems), or we need to radically increase the costs of our PostGIS
functions (right now, most "costly" functions are cost 100, but I had
to push costs up into the 100000 range to get parallelism to kick in
sometimes). Some guidelines on cost setting would be useful, something
that says, "this function run against this kind of data is cost level
1, compare the time your function takes on 'standard' data to the
baseline function to get a cost ratio to use in the function
definition"

ATB,

P.

[1]: https://gist.github.com/pramsey/84e7a39d83cccae692f8

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#13)
Re: parallel joins, and better parallel explain

On Mon, Dec 14, 2015 at 8:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

set enable_hashjoin=off;
set enable_mergejoin=off;

[ ... ]

Now here the point to observe is that non-parallel case uses both less
Execution time and Planning time to complete the statement. There
is a considerable increase in planning time without any benefit in
execution.

So, you forced the query planner to give you a bad plan, and then
you're complaining that the plan is bad? That's not a very surprising
result.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#15)
Re: parallel joins, and better parallel explain

On Tue, Dec 15, 2015 at 7:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 14, 2015 at 8:38 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

set enable_hashjoin=off;
set enable_mergejoin=off;

[ ... ]

Now here the point to observe is that non-parallel case uses both less
Execution time and Planning time to complete the statement. There
is a considerable increase in planning time without any benefit in
execution.

So, you forced the query planner to give you a bad plan, and then
you're complaining that the plan is bad?

Oh no, I wanted to check the behaviour of parallel vs. non-parallel
execution of joins. I think even if hash and merge join are set to
off, it should have picked up non-parallel NestLoop plan. In any case,
I have done some more investigation of the patch and found that even
without changing query planner related parameters, it seems to give
bad plans (as in example below [1]CREATE TABLE t1(c1, c2) AS SELECT g, repeat('x', 5) FROM generate_series(1, 10000000) g;). I think here the costing of rework
each
worker has to do seems to be missing which is causing bad plans to
be selected over good plans. Another point is that with patch, the number
of
paths that we explore to get the cheapest path have increased, do you think
we should try to evaluate it? One way is we run some queries where there
are more number of joins and see the impact on planner time and other is we
try to calculate the increase in number of paths that planner explores.

[1]: CREATE TABLE t1(c1, c2) AS SELECT g, repeat('x', 5) FROM generate_series(1, 10000000) g;
CREATE TABLE t1(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 10000000) g;

CREATE TABLE t2(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 3000000) g;

Analyze t1;
Analyze t2;

Non-parallel case

postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1
AND t1.c1 BETWEEN 100 AND 200;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------
-------------
Aggregate (cost=261519.93..261519.94 rows=1 width=0) (actual
time=2779.965..2779.965 rows=1 loops=1)
-> Hash Join (cost=204052.91..261519.92 rows=1 width=0) (actual
time=2017.241..2779.947 rows=101
loops=1)
Hash Cond: (t2.c1 = t1.c1)
-> Seq Scan on t2 (cost=0.00..46217.00 rows=3000000 width=4)
(actual time=0.073..393.479
rows=3000000 loops=1)
-> Hash (cost=204052.90..204052.90 rows=1 width=4) (actual
time=2017.130..2017.130 rows=101
loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on t1 (cost=0.00..204052.90 rows=1 width=4)
(actual time=0.038..2017.105
rows=101 loops=1)
Filter: ((c1 >= 100) AND (c1 <= 200))
Rows Removed by Filter: 9999899
Planning time: 0.113 ms
Execution time: 2780.000 ms
(11 rows)

Parallel-case
set max_parallel_degree=4;

postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1
AND t1.c1 BETWEEN 100 AND 200;
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=100895.52..100895.53 rows=1 width=0) (actual
time=67871.443..67871.443 rows=1 loops=1)
-> Gather (cost=1000.00..100895.52 rows=1 width=0) (actual
time=0.653..67871.287 rows=101 loops=1)
Number of Workers: 4
-> Nested Loop (cost=0.00..99895.42 rows=1 width=0) (actual
time=591.408..16455.731 rows=20 loops=5)
Join Filter: (t1.c1 = t2.c1)
Rows Removed by Join Filter: 60599980
-> Parallel Seq Scan on t1 (cost=0.00..45345.09 rows=0
width=4) (actual time=433.350..433.386 rows=20 loops=5)
Filter: ((c1 >= 100) AND (c1 <= 200))
Rows Removed by Filter: 1999980
-> Seq Scan on t2 (cost=0.00..46217.00 rows=3000000
width=4) (actual time=0.005..395.480 rows=3000000 loops=101)
Planning time: 0.114 ms
Execution time: 67871.584 ms
(12 rows)

Without patch, parallel case

set max_parallel_degree=4;

Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1
BETWEEN 100 AND 200;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=103812.21..103812.22 rows=1 width=0) (actual
time=1207.043..1207.043 rows=1 loops=1)
-> Hash Join (cost=46345.20..103812.21 rows=1 width=0) (actual
time=428.632..1207.027 rows=101 loops=1)
Hash Cond: (t2.c1 = t1.c1)
-> Seq Scan on t2 (cost=0.00..46217.00 rows=3000000 width=4)
(actual time=0.034..375.670 rows=3000000 loops=1)
-> Hash (cost=46345.19..46345.19 rows=1 width=4) (actual
time=428.557..428.557 rows=101 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Gather (cost=1000.00..46345.19 rows=1 width=4) (actual
time=0.287..428.476 rows=101 loops=1)
Number of Workers: 4
-> Parallel Seq Scan on t1 (cost=0.00..45345.09
rows=1 width=4) (actual time=340.139..425.591 rows=20 loops=5)
Filter: ((c1 >= 100) AND (c1 <= 200))
Rows Removed by Filter: 1999980
Planning time: 0.116 ms
Execution time: 1207.196 ms
(13 rows)

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#17Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#16)
12 attachment(s)
Re: parallel joins, and better parallel explain

On Wed, Dec 16, 2015 at 6:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 15, 2015 at 7:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 14, 2015 at 8:38 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

In any case,
I have done some more investigation of the patch and found that even
without changing query planner related parameters, it seems to give
bad plans (as in example below [1]). I think here the costing of rework

each

I have done some more testing using TPC-H benchmark (For some of the
queries, specially for Parallel Hash Join), and Results summary is as below.

*Planning Time(ms)*
*Query* *Base* *Patch* TPC-H Q2 2.2 2.4 TPCH- Q3 0.67 0.71 TPCH- Q5
3.17 2.3 TPCH-
Q7 2.43 2.4

*Execution Time(ms)*
*Query* *Base* *Patch* TPC-H Q2 2826 766 TPCH- Q3 23473 24271 TPCH- Q5 21357
1432 TPCH- Q7 6779 1138
All Test files and Detail plan output is attached in mail
q2.sql, q3.sql, q.5.sql ans q7.sql are TPCH benchmark' 2nd, 3rd, 5th and
7th query
and Results with base and Parallel join are attached in q*_base.out and
q*_parallel.out respectively.

Summary: With TPC-H queries where ever Hash Join is pushed under gather
Node, significant improvement is visible,
with Q2, using 3 workers, time consumed is almost 1/3 of the base.

I Observed one problem, with Q5 and Q7, there some relation and snapshot
references are leaked and i am getting below warning, havn't yet looked
into the issue.

WARNING: relcache reference leak: relation "customer" not closed
WARNING: relcache reference leak: relation "customer" not closed
WARNING: relcache reference leak: relation "customer" not closed
WARNING: Snapshot reference leak: Snapshot 0x2d1fee8 still referenced
WARNING: Snapshot reference leak: Snapshot 0x2d1fee8 still referenced
WARNING: Snapshot reference leak: Snapshot 0x2d1fee8 still referenced
WARNING: relcache reference leak: relation "customer" not closed
CONTEXT: parallel worker, PID 123413
WARNING: Snapshot reference leak: Snapshot 0x2d1fee8 still referenced
CONTEXT: parallel worker, PID 123413
WARNING: relcache reference leak: relation "customer" not closed
CONTEXT: parallel worker, PID 123412
WARNING: Snapshot reference leak: Snapshot 0x2d1fee8 still referenced
CONTEXT: parallel worker, PID 123412
WARNING: relcache reference leak: relation "customer" not closed
CONTEXT: parallel worker, PID 123411
WARNING: Snapshot reference leak: Snapshot 0x2d1fee8 still referenced
CONTEXT: parallel worker, PID 123411
psql:q7.sql:40: WARNING: relcache reference leak: relation "customer" not
closed
psql:q7.sql:40: WARNING: Snapshot reference leak: Snapshot 0x2d1fee8 still
referenced

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Wed, Dec 16, 2015 at 6:19 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Show quoted text

On Tue, Dec 15, 2015 at 7:31 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Mon, Dec 14, 2015 at 8:38 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

set enable_hashjoin=off;
set enable_mergejoin=off;

[ ... ]

Now here the point to observe is that non-parallel case uses both less
Execution time and Planning time to complete the statement. There
is a considerable increase in planning time without any benefit in
execution.

So, you forced the query planner to give you a bad plan, and then
you're complaining that the plan is bad?

Oh no, I wanted to check the behaviour of parallel vs. non-parallel
execution of joins. I think even if hash and merge join are set to
off, it should have picked up non-parallel NestLoop plan. In any case,
I have done some more investigation of the patch and found that even
without changing query planner related parameters, it seems to give
bad plans (as in example below [1]). I think here the costing of rework
each
worker has to do seems to be missing which is causing bad plans to
be selected over good plans. Another point is that with patch, the number
of
paths that we explore to get the cheapest path have increased, do you think
we should try to evaluate it? One way is we run some queries where there
are more number of joins and see the impact on planner time and other is we
try to calculate the increase in number of paths that planner explores.

[1] -
CREATE TABLE t1(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 10000000) g;

CREATE TABLE t2(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 3000000) g;

Analyze t1;
Analyze t2;

Non-parallel case

postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 =
t2.c1 AND t1.c1 BETWEEN 100 AND 200;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------
-------------
Aggregate (cost=261519.93..261519.94 rows=1 width=0) (actual
time=2779.965..2779.965 rows=1 loops=1)
-> Hash Join (cost=204052.91..261519.92 rows=1 width=0) (actual
time=2017.241..2779.947 rows=101
loops=1)
Hash Cond: (t2.c1 = t1.c1)
-> Seq Scan on t2 (cost=0.00..46217.00 rows=3000000 width=4)
(actual time=0.073..393.479
rows=3000000 loops=1)
-> Hash (cost=204052.90..204052.90 rows=1 width=4) (actual
time=2017.130..2017.130 rows=101
loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on t1 (cost=0.00..204052.90 rows=1 width=4)
(actual time=0.038..2017.105
rows=101 loops=1)
Filter: ((c1 >= 100) AND (c1 <= 200))
Rows Removed by Filter: 9999899
Planning time: 0.113 ms
Execution time: 2780.000 ms
(11 rows)

Parallel-case
set max_parallel_degree=4;

postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 =
t2.c1 AND t1.c1 BETWEEN 100 AND 200;
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=100895.52..100895.53 rows=1 width=0) (actual
time=67871.443..67871.443 rows=1 loops=1)
-> Gather (cost=1000.00..100895.52 rows=1 width=0) (actual
time=0.653..67871.287 rows=101 loops=1)
Number of Workers: 4
-> Nested Loop (cost=0.00..99895.42 rows=1 width=0) (actual
time=591.408..16455.731 rows=20 loops=5)
Join Filter: (t1.c1 = t2.c1)
Rows Removed by Join Filter: 60599980
-> Parallel Seq Scan on t1 (cost=0.00..45345.09 rows=0
width=4) (actual time=433.350..433.386 rows=20 loops=5)
Filter: ((c1 >= 100) AND (c1 <= 200))
Rows Removed by Filter: 1999980
-> Seq Scan on t2 (cost=0.00..46217.00 rows=3000000
width=4) (actual time=0.005..395.480 rows=3000000 loops=101)
Planning time: 0.114 ms
Execution time: 67871.584 ms
(12 rows)

Without patch, parallel case

set max_parallel_degree=4;

Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1
BETWEEN 100 AND 200;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=103812.21..103812.22 rows=1 width=0) (actual
time=1207.043..1207.043 rows=1 loops=1)
-> Hash Join (cost=46345.20..103812.21 rows=1 width=0) (actual
time=428.632..1207.027 rows=101 loops=1)
Hash Cond: (t2.c1 = t1.c1)
-> Seq Scan on t2 (cost=0.00..46217.00 rows=3000000 width=4)
(actual time=0.034..375.670 rows=3000000 loops=1)
-> Hash (cost=46345.19..46345.19 rows=1 width=4) (actual
time=428.557..428.557 rows=101 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Gather (cost=1000.00..46345.19 rows=1 width=4) (actual
time=0.287..428.476 rows=101 loops=1)
Number of Workers: 4
-> Parallel Seq Scan on t1 (cost=0.00..45345.09
rows=1 width=4) (actual time=340.139..425.591 rows=20 loops=5)
Filter: ((c1 >= 100) AND (c1 <= 200))
Rows Removed by Filter: 1999980
Planning time: 0.116 ms
Execution time: 1207.196 ms
(13 rows)

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

q2_base.outapplication/octet-stream; name=q2_base.outDownload
q2_parallel.outapplication/octet-stream; name=q2_parallel.outDownload
q3_base.outapplication/octet-stream; name=q3_base.outDownload
q3_parallel.outapplication/octet-stream; name=q3_parallel.outDownload
q5_base.outapplication/octet-stream; name=q5_base.outDownload
q5_parallel.outapplication/octet-stream; name=q5_parallel.outDownload
q7_base.outapplication/octet-stream; name=q7_base.outDownload
q7_parallel.outapplication/octet-stream; name=q7_parallel.outDownload
q3.sqltext/x-sql; charset=US-ASCII; name=q3.sqlDownload
q5.sqltext/x-sql; charset=US-ASCII; name=q5.sqlDownload
q7.sqltext/x-sql; charset=US-ASCII; name=q7.sqlDownload
q2.sqltext/x-sql; charset=US-ASCII; name=q2.sqlDownload
#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#17)
Re: parallel joins, and better parallel explain

On Wed, Dec 16, 2015 at 9:55 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Dec 16, 2015 at 6:20 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Dec 15, 2015 at 7:31 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Mon, Dec 14, 2015 at 8:38 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

In any case,
I have done some more investigation of the patch and found that even
without changing query planner related parameters, it seems to give
bad plans (as in example below [1]). I think here the costing of rework

each

I have done some more testing using TPC-H benchmark (For some of the
queries, specially for Parallel Hash Join), and Results summary is as below.

*Planning Time(ms)*
*Query* *Base* *Patch* TPC-H Q2 2.2 2.4 TPCH- Q3 0.67 0.71 TPCH- Q5 3.17
2.3 TPCH- Q7 2.43 2.4

*Execution Time(ms)*
*Query* *Base* *Patch* TPC-H Q2 2826 766 TPCH- Q3 23473 24271 TPCH- Q5
21357 1432 TPCH- Q7 6779 1138
All Test files and Detail plan output is attached in mail
q2.sql, q3.sql, q.5.sql ans q7.sql are TPCH benchmark' 2nd, 3rd, 5th and
7th query
and Results with base and Parallel join are attached in q*_base.out and
q*_parallel.out respectively.

Summary: With TPC-H queries where ever Hash Join is pushed under gather
Node, significant improvement is visible,
with Q2, using 3 workers, time consumed is almost 1/3 of the base.

I Observed one problem, with Q5 and Q7, there some relation and snapshot
references are leaked and i am getting below warning, havn't yet looked
into the issue.

While looking at plans of Q5 and Q7, I have observed that Gather is
pushed below another Gather node for which we don't have appropriate
way of dealing. I think that could be the reason why you are seeing
the errors.

Also, I think it would be good if you can once check the plan/execution
time with max_parallel_degree=0 as that can give us base reference
data without parallelism, also I am wondering if have you have changed
any other parallel cost related parameter?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#19Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#18)
6 attachment(s)
Re: parallel joins, and better parallel explain

On Wed, Dec 17, 2015 at 11:03 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

While looking at plans of Q5 and Q7, I have observed that Gather is
pushed below another Gather node for which we don't have appropriate
way of dealing. I think that could be the reason why you are seeing
the errors.

Ok

Also, I think it would be good if you can once check the plan/execution
time with max_parallel_degree=0 as that can give us base reference
data without parallelism, also I am wondering if have you have changed
any other parallel cost related parameter?

Oops, Earlier i had changed parallel_tuple_cost parameter to 0.01, now i
have changed it to default value 0.1 and taken performance again, with
max_parallel_degree=0
and max_parallel_degree=4.

Note: Last time i used scale factor 1 for generating TPC-H data (./dbgen -v
-s 1), but after using default value of parallel_tuple_cost, it was not
selecting parallel join, so i have taken the results with scale factor 5
(./dbgen -v -s 5)

Below are the latest performance data.

1. TPC-H Q2:
max_parallel_degree=0
Planning time: 2.321 ms
Execution time: 829.817 ms

max_parallel_degree=4
Planning time: 2.530 ms
Execution time: 803.428 ms
2. TPC-H Q5:
max_parallel_degree=0
Planning time: 1.938 ms
Execution time: 1062.419 ms

max_parallel_degree=4
Planning time: 2.950 ms
Execution time: 487.461 ms

3. TPC-H Q7:
max_parallel_degree=0
Planning time: 2.515 ms
Execution time: 1651.763 ms

max_parallel_degree=4
Planning time: 2.379 ms
Execution time: 2107.863 ms

Plans for max_parallel_degree=0 and max_parallel_degree=4 are attached in
the mail with file names are q*_base.out and q*_parallel.out respectively.

For Q3 its not selecting parallel plan.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Thu, Dec 17, 2015 at 11:03 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Show quoted text

On Wed, Dec 16, 2015 at 9:55 PM, Dilip Kumar <dilipbalaut@gmail.com>
wrote:

On Wed, Dec 16, 2015 at 6:20 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Dec 15, 2015 at 7:31 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Mon, Dec 14, 2015 at 8:38 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

In any case,
I have done some more investigation of the patch and found that even
without changing query planner related parameters, it seems to give
bad plans (as in example below [1]). I think here the costing of rework

each

I have done some more testing using TPC-H benchmark (For some of the
queries, specially for Parallel Hash Join), and Results summary is as below.

*Planning Time(ms)*
*Query* *Base* *Patch* TPC-H Q2 2.2 2.4 TPCH- Q3 0.67 0.71 TPCH- Q5 3.17
2.3 TPCH- Q7 2.43 2.4

*Execution Time(ms)*
*Query* *Base* *Patch* TPC-H Q2 2826 766 TPCH- Q3 23473 24271 TPCH- Q5
21357 1432 TPCH- Q7 6779 1138
All Test files and Detail plan output is attached in mail
q2.sql, q3.sql, q.5.sql ans q7.sql are TPCH benchmark' 2nd, 3rd, 5th and
7th query
and Results with base and Parallel join are attached in q*_base.out and
q*_parallel.out respectively.

Summary: With TPC-H queries where ever Hash Join is pushed under gather
Node, significant improvement is visible,
with Q2, using 3 workers, time consumed is almost 1/3 of the base.

I Observed one problem, with Q5 and Q7, there some relation and snapshot
references are leaked and i am getting below warning, havn't yet looked
into the issue.

While looking at plans of Q5 and Q7, I have observed that Gather is
pushed below another Gather node for which we don't have appropriate
way of dealing. I think that could be the reason why you are seeing
the errors.

Also, I think it would be good if you can once check the plan/execution
time with max_parallel_degree=0 as that can give us base reference
data without parallelism, also I am wondering if have you have changed
any other parallel cost related parameter?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

q2_base.outapplication/octet-stream; name=q2_base.outDownload
q2_parallel.outapplication/octet-stream; name=q2_parallel.outDownload
q5_base.outapplication/octet-stream; name=q5_base.outDownload
q5_parallel.outapplication/octet-stream; name=q5_parallel.outDownload
q7_base.outapplication/octet-stream; name=q7_base.outDownload
q7_parallel.outapplication/octet-stream; name=q7_parallel.outDownload
#20Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#18)
Re: parallel joins, and better parallel explain

On Thu, Dec 17, 2015 at 12:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

While looking at plans of Q5 and Q7, I have observed that Gather is
pushed below another Gather node for which we don't have appropriate
way of dealing. I think that could be the reason why you are seeing
the errors.

Uh oh. That's not supposed to happen. A GatherPath is supposed to
have parallel_safe = false, which should prevent the planner from
using it to form new partial paths. Is this with the latest version
of the patch? The plan output suggests that we're somehow reaching
try_partial_hashjoin_path() with inner_path being a GatherPath, but I
don't immediately see how that's possible, because
create_gather_path() sets parallel_safe to false unconditionally, and
hash_inner_and_outer() never sets cheapest_safe_inner to a path unless
that path is parallel_safe.

Do you have a self-contained test case that reproduces this, or any
insight as to how it's happening here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#20)
1 attachment(s)
Re: parallel joins, and better parallel explain

On Fri, Dec 18, 2015 at 7.59 AM Robert Haas <robertmhaas@gmail.com> Wrote,

Uh oh. That's not supposed to happen. A GatherPath is supposed to
have parallel_safe = false, which should prevent the planner from
using it to form new partial paths. Is this with the latest version
of the patch? The plan output suggests that we're somehow reaching
try_partial_hashjoin_path() with inner_path being a GatherPath, but I
don't immediately see how that's possible, because
create_gather_path() sets parallel_safe to false unconditionally, and
hash_inner_and_outer() never sets cheapest_safe_inner to a path unless
that path is parallel_safe.

Yes, you are right, that create_gather_path() sets parallel_safe to false
unconditionally but whenever we are building a non partial path, that time
we should carry forward the parallel_safe state to its parent, and it seems
like that part is missing here..

I have done quick hack in create_nestloop_path and error is gone with this
change..

create_nestloop_path
{
pathnode->path.param_info =
get_joinrel_parampathinfo(root,
joinrel,
outer_path,
inner_path,
sjinfo,
required_outer,
&restrict_clauses);
pathnode->path.parallel_aware = false;
pathnode->path.parallel_safe = joinrel->consider_parallel; //may be
joinrel is parallel safe but particular path is unsafe, so we need take
this from inner_path and outer_path

// if any of the child path is parallel unsafe the mark parent as
parallel unsafe..
* pathnode->path.parallel_safe = (inner_path->parallel_safe &
outer_path->parallel_safe); *

}

New plan is attached in the mail.

with this change now we can see execution time is also become half (may be
because warning messages are gone)

Do you have a self-contained test case that reproduces this, or any
insight as to how it's happening here?

This is TPC-H benchmark case:
we can setup like this..
1. git clone https://tkejser@bitbucket.org/tkejser/tpch-dbgen.git
2. complie using make
3. ./dbgen –v –s 5
4. ./qgen

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Fri, Dec 18, 2015 at 7:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Show quoted text

On Thu, Dec 17, 2015 at 12:33 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

While looking at plans of Q5 and Q7, I have observed that Gather is
pushed below another Gather node for which we don't have appropriate
way of dealing. I think that could be the reason why you are seeing
the errors.

Uh oh. That's not supposed to happen. A GatherPath is supposed to
have parallel_safe = false, which should prevent the planner from
using it to form new partial paths. Is this with the latest version
of the patch? The plan output suggests that we're somehow reaching
try_partial_hashjoin_path() with inner_path being a GatherPath, but I
don't immediately see how that's possible, because
create_gather_path() sets parallel_safe to false unconditionally, and
hash_inner_and_outer() never sets cheapest_safe_inner to a path unless
that path is parallel_safe.

Do you have a self-contained test case that reproduces this, or any
insight as to how it's happening here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

q7_parallel_new.outapplication/octet-stream; name=q7_parallel_new.outDownload
#22Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#21)
3 attachment(s)
Re: parallel joins, and better parallel explain

On Fri, Dec 18, 2015 at 3:54 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Dec 18, 2015 at 7.59 AM Robert Haas <robertmhaas@gmail.com> Wrote,

Uh oh. That's not supposed to happen. A GatherPath is supposed to
have parallel_safe = false, which should prevent the planner from
using it to form new partial paths. Is this with the latest version
of the patch? The plan output suggests that we're somehow reaching
try_partial_hashjoin_path() with inner_path being a GatherPath, but I
don't immediately see how that's possible, because
create_gather_path() sets parallel_safe to false unconditionally, and
hash_inner_and_outer() never sets cheapest_safe_inner to a path unless
that path is parallel_safe.

Yes, you are right, that create_gather_path() sets parallel_safe to false
unconditionally but whenever we are building a non partial path, that time
we should carry forward the parallel_safe state to its parent, and it seems
like that part is missing here..

Ah, right. Woops. I can't exactly replicate your results, but I've
attempted to fix this in a systematic way in the new version attached
here (parallel-join-v3.patch).

Do you have a self-contained test case that reproduces this, or any
insight as to how it's happening here?

This is TPC-H benchmark case:
we can setup like this..
1. git clone https://tkejser@bitbucket.org/tkejser/tpch-dbgen.git
2. complie using make
3. ./dbgen –v –s 5
4. ./qgen

Thanks. After a bit of fiddling I was able to get this to work. I'm
attaching two other patches that seem to help this case quite
considerably. The first (parallel-reader-order-v1) cause Gather to
read from the same worker repeatedly until it can't get another tuple
from that worker without blocking, and only then move on to the next
worker. With 4 workers, this seems to be drastically more efficient
than what's currently in master - I saw the time for Q5 drop from over
17 seconds to about 6 (this was an assert-enabled build running with
EXPLAIN ANALYZE, though, so take those numbers with a grain of salt).
The second (gather-disuse-physical-tlist.patch) causes Gather to force
underlying scan nodes to project, which is a good idea here for
reasons very similar to why it's a good idea for the existing node
types that use disuse_physical_tlist: forcing extra data through the
Gather node is bad. That shaved another half second off this query.

The exact query I was using for testing was:

explain (analyze, verbose) select n_name, sum(l_extendedprice * (1 -
l_discount)) as revenue from customer, orders, lineitem, supplier,
nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey
and l_suppkey = s_suppkey and c_nationkey = s_nationkey and
s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name =
'EUROPE' and o_orderdate >= date '1995-01-01' and o_orderdate < date
'1995-01-01' + interval '1' year group by n_name order by revenue
desc;

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel-join-v3.patchtext/x-patch; charset=US-ASCII; name=parallel-join-v3.patchDownload
From 088a96363231b441fe6aab744b04a522d01fbc17 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 19 Nov 2015 20:28:34 -0500
Subject: [PATCH 1/3] Gather pushdown for child tables, hash joins, nested
 loops.

Cost model fix for parallel seq scan.

Fixed this so it propagates the parallel_safe flag up the plan tree.
---
 src/backend/executor/execParallel.c     |  66 +++---
 src/backend/nodes/outfuncs.c            |   4 +-
 src/backend/optimizer/README            |  55 ++++-
 src/backend/optimizer/path/allpaths.c   | 164 +++++++++++----
 src/backend/optimizer/path/costsize.c   |  32 +--
 src/backend/optimizer/path/joinpath.c   | 253 +++++++++++++++++++++-
 src/backend/optimizer/path/joinrels.c   |   3 +-
 src/backend/optimizer/plan/createplan.c |   2 +-
 src/backend/optimizer/plan/planmain.c   |   3 +-
 src/backend/optimizer/util/pathnode.c   | 361 +++++++++++++++++++++++++++++---
 src/backend/optimizer/util/relnode.c    |   2 +
 src/include/nodes/relation.h            |   4 +-
 src/include/optimizer/cost.h            |   2 +-
 src/include/optimizer/pathnode.h        |  12 +-
 src/include/optimizer/paths.h           |   2 +
 15 files changed, 845 insertions(+), 120 deletions(-)

diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 30e6b3d..5bc8eef 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -167,25 +167,25 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 	e->nnodes++;
 
 	/* Call estimators for parallel-aware nodes. */
-	switch (nodeTag(planstate))
+	if (planstate->plan->parallel_aware)
 	{
-		case T_SeqScanState:
-			ExecSeqScanEstimate((SeqScanState *) planstate,
-								e->pcxt);
-			break;
-		default:
-			break;
+		switch (nodeTag(planstate))
+		{
+			case T_SeqScanState:
+				ExecSeqScanEstimate((SeqScanState *) planstate,
+									e->pcxt);
+				break;
+			default:
+				break;
+		}
 	}
 
 	return planstate_tree_walker(planstate, ExecParallelEstimate, e);
 }
 
 /*
- * Ordinary plan nodes won't do anything here, but parallel-aware plan nodes
- * may need to initialize shared state in the DSM before parallel workers
- * are available.  They can allocate the space they previous estimated using
- * shm_toc_allocate, and add the keys they previously estimated using
- * shm_toc_insert, in each case targeting pcxt->toc.
+ * Initialize the dynamic shared memory segment that will be used to control
+ * parallel execution.
  */
 static bool
 ExecParallelInitializeDSM(PlanState *planstate,
@@ -202,15 +202,26 @@ ExecParallelInitializeDSM(PlanState *planstate,
 	/* Count this node. */
 	d->nnodes++;
 
-	/* Call initializers for parallel-aware plan nodes. */
-	switch (nodeTag(planstate))
+	/*
+	 * Call initializers for parallel-aware plan nodes.
+	 *
+	 * Ordinary plan nodes won't do anything here, but parallel-aware plan
+	 * nodes may need to initialize shared state in the DSM before parallel
+	 * workers are available.  They can allocate the space they previously
+	 * estimated using shm_toc_allocate, and add the keys they previously
+	 * estimated using shm_toc_insert, in each case targeting pcxt->toc.
+	 */
+	if (planstate->plan->parallel_aware)
 	{
-		case T_SeqScanState:
-			ExecSeqScanInitializeDSM((SeqScanState *) planstate,
-									 d->pcxt);
-			break;
-		default:
-			break;
+		switch (nodeTag(planstate))
+		{
+			case T_SeqScanState:
+				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
+										 d->pcxt);
+				break;
+			default:
+				break;
+		}
 	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeDSM, d);
@@ -623,13 +634,16 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 		return false;
 
 	/* Call initializers for parallel-aware plan nodes. */
-	switch (nodeTag(planstate))
+	if (planstate->plan->parallel_aware)
 	{
-		case T_SeqScanState:
-			ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
-			break;
-		default:
-			break;
+		switch (nodeTag(planstate))
+		{
+			case T_SeqScanState:
+				ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
+				break;
+			default:
+				break;
+		}
 	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeWorker, toc);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 63fae82..6658a76 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1588,6 +1588,8 @@ _outPathInfo(StringInfo str, const Path *node)
 	else
 		_outBitmapset(str, NULL);
 	WRITE_BOOL_FIELD(parallel_aware);
+	WRITE_BOOL_FIELD(parallel_safe);
+	WRITE_INT_FIELD(parallel_degree);
 	WRITE_FLOAT_FIELD(rows, "%.0f");
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
@@ -1765,7 +1767,6 @@ _outGatherPath(StringInfo str, const GatherPath *node)
 	_outPathInfo(str, (const Path *) node);
 
 	WRITE_NODE_FIELD(subpath);
-	WRITE_INT_FIELD(num_workers);
 	WRITE_BOOL_FIELD(single_copy);
 }
 
@@ -1887,6 +1888,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_NODE_FIELD(reltargetlist);
 	WRITE_NODE_FIELD(pathlist);
 	WRITE_NODE_FIELD(ppilist);
+	WRITE_NODE_FIELD(partial_pathlist);
 	WRITE_NODE_FIELD(cheapest_startup_path);
 	WRITE_NODE_FIELD(cheapest_total_path);
 	WRITE_NODE_FIELD(cheapest_unique_path);
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 916a518..5019804 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -851,4 +851,57 @@ lateral reference.  (Perhaps now that that stuff works, we could relax the
 pullup restriction?)
 
 
--- bjm & tgl
+Parallel Query and Partial Paths
+--------------------------------
+
+Parallel query involves dividing up the work that needs to be performed
+either by an entire query or some portion of the query in such a way that
+some of that work can be done by one or more worker processes, which are
+called parallel workers.  Parallel workers are a subtype of dynamic
+background workers; see src/backend/access/transam/README.parallel for a
+fuller description.  Academic literature on parallel query suggests that
+that parallel execution strategies can be divided into essentially two
+categories: pipelined parallelism, where the execution of the query is
+divided into multiple stages and each stage is handled by a separate
+process; and partitioning parallelism, where the data is split between
+multiple processes and each process handles a subset of it.  The
+literature, however, suggests that gains from pipeline parallelism are
+often very limited due to the difficulty of avoiding pipeline stalls.
+Consequently, we do not currently attempt to generate query plans that
+use this technique.
+
+Instead, we focus on partitioning paralellism, which does not require
+that the underlying table be partitioned.  It only requires that (1)
+there is some method of dividing the data from at least one of the base
+tables involved in the relation across multiple processes, (2) allowing
+each process to handle its own portion of the data, and then (3)
+collecting the results.  Requirements (2) and (3) is satisfied by the
+executor node Gather, which launches any number of worker processes and
+executes its single child plan in all of them (and perhaps in the leader
+also, if the children aren't generating enough data to keep the leader
+busy).  Requirement (1) is handled by the SeqScan node: when invoked
+with parallel_aware = true, this node will, in effect, partition the
+table on a block by block basis, returning a subset of the tuples from
+the relation in each worker where that SeqScan is executed.  A similar
+scheme could be (and probably should be) implemented for bitmap heap
+scans.
+
+Just as we do for non-parallel access methods, we build Paths to
+represent access strategies that can be used in a parallel plan.  These
+are, in essence, the same strategies that are available in the
+non-parallel plan, but there is an important difference: a path that
+will run beneath a Gather node returns only a subset of the query
+results in each worker, not all of them.  To form a path that can
+actually be executed, the (rather large) cost of the Gather node must be
+accounted for.  For this reason among others, paths intended to run
+beneath a Gather node - which we call "partial" paths since they return
+only a subset of the results in each worker - must be kept separate from
+ordinary paths (see RelOptInfo's partial_pathlist and the function
+add_partial_path).
+
+One of the keys to making parallel query effective is to run as much of
+the query in parallel as possible.  Therefore, we expect it to generally
+be desirable to postpone the Gather stage until as near to the top of the
+plan as possible.  Expanding the range of cases in which more work can be
+pushed below the Gather (and costly them accurately) is likely to keep us
+busy for a long time to come.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 4516cd3..326de0c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -72,6 +72,7 @@ static void set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 				 Index rti, RangeTblEntry *rte);
 static void set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel,
 				   RangeTblEntry *rte);
+static void create_parallel_paths(PlannerInfo *root, RelOptInfo *rel);
 static void set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel,
 						  RangeTblEntry *rte);
 static bool function_rte_parallel_ok(RangeTblEntry *rte);
@@ -612,7 +613,6 @@ static void
 set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 {
 	Relids		required_outer;
-	int			parallel_threshold = 1000;
 
 	/*
 	 * We don't support pushing join clauses into the quals of a seqscan, but
@@ -624,39 +624,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer, 0));
 
-	/* Consider parallel sequential scan */
-	if (rel->consider_parallel && rel->pages > parallel_threshold &&
-		required_outer == NULL)
-	{
-		Path *path;
-		int parallel_degree = 1;
-
-		/*
-		 * Limit the degree of parallelism logarithmically based on the size
-		 * of the relation.  This probably needs to be a good deal more
-		 * sophisticated, but we need something here for now.
-		 */
-		while (rel->pages > parallel_threshold * 3 &&
-			   parallel_degree < max_parallel_degree)
-		{
-			parallel_degree++;
-			parallel_threshold *= 3;
-			if (parallel_threshold >= PG_INT32_MAX / 3)
-				break;
-		}
-
-		/*
-		 * Ideally we should consider postponing the gather operation until
-		 * much later, after we've pushed joins and so on atop the parallel
-		 * sequential scan path.  But we don't have the infrastructure for
-		 * that yet, so just do this for now.
-		 */
-		path = create_seqscan_path(root, rel, required_outer, parallel_degree);
-		path = (Path *)
-			create_gather_path(root, rel, path, required_outer,
-							   parallel_degree);
-		add_path(rel, path);
-	}
+	/* If appropriate, consider parallel sequential scan */
+	if (rel->consider_parallel && required_outer == NULL)
+		create_parallel_paths(root, rel);
 
 	/* Consider index scans */
 	create_index_paths(root, rel);
@@ -666,6 +636,54 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 }
 
 /*
+ * create_parallel_paths
+ *	  Build parallel access paths for a plain relation
+ */
+static void
+create_parallel_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int		parallel_threshold = 1000;
+	int		parallel_degree = 1;
+
+	/*
+	 * If this relation is too small to be worth a parallel scan, just return
+	 * without doing anything ... unless it's an inheritance child.  In that case,
+	 * we want to generate a parallel path here anyway.  It might not be worthwhile
+	 * just for this relation, but when combined with all of its inheritance siblings
+	 * it may well pay off.
+	 */
+	if (rel->pages < parallel_threshold && rel->reloptkind == RELOPT_BASEREL)
+		return;
+
+	/*
+	 * Limit the degree of parallelism logarithmically based on the size of the
+	 * relation.  This probably needs to be a good deal more sophisticated, but we
+	 * need something here for now.
+	 */
+	while (rel->pages > parallel_threshold * 3 &&
+		   parallel_degree < max_parallel_degree)
+	{
+		parallel_degree++;
+		parallel_threshold *= 3;
+		if (parallel_threshold >= PG_INT32_MAX / 3)
+			break;
+	}
+
+	/* Add an unordered partial path based on a parallel sequential scan. */
+	add_partial_path(rel, create_seqscan_path(root, rel, NULL, parallel_degree));
+
+	/*
+	 * If this is a baserel, consider gathering any partial paths we may have
+	 * just created.  If we gathered an inheritance child, we could end up
+	 * with a very large number of gather nodes, each trying to grab its own
+	 * pool of workers, so don't do this in that case.  Instead, we'll
+	 * consider gathering partial paths for the appendrel.
+	 */
+	if (rel->reloptkind == RELOPT_BASEREL)
+		generate_gather_paths(root, rel);
+}
+
+/*
  * set_tablesample_rel_size
  *	  Set size estimates for a sampled relation
  */
@@ -1039,6 +1057,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	List	   *live_childrels = NIL;
 	List	   *subpaths = NIL;
 	bool		subpaths_valid = true;
+	List	   *partial_subpaths = NIL;
+	bool		partial_subpaths_valid = true;
 	List	   *all_child_pathkeys = NIL;
 	List	   *all_child_outers = NIL;
 	ListCell   *l;
@@ -1093,6 +1113,13 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		else
 			subpaths_valid = false;
 
+		/* Same idea, but for a partial plan. */
+		if (childrel->partial_pathlist != NIL)
+			partial_subpaths = accumulate_append_subpath(partial_subpaths,
+									   linitial(childrel->partial_pathlist));
+		else
+			partial_subpaths_valid = false;
+
 		/*
 		 * Collect lists of all the available path orderings and
 		 * parameterizations for all the children.  We use these as a
@@ -1164,7 +1191,39 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 * if we have zero or one live subpath due to constraint exclusion.)
 	 */
 	if (subpaths_valid)
-		add_path(rel, (Path *) create_append_path(rel, subpaths, NULL));
+		add_path(rel, (Path *) create_append_path(rel, subpaths, NULL, 0));
+
+	/*
+	 * Consider an append of partial unordered, unparameterized partial paths.
+	 */
+	if (partial_subpaths_valid)
+	{
+		AppendPath *appendpath;
+		ListCell   *lc;
+		int			parallel_degree = 0;
+
+		/*
+		 * Decide what parallel degree to request for this append path.  For
+		 * now, we just use the maximum parallel degree of any member.  It
+		 * might be useful to use a higher number if the Append node were
+		 * smart enough to spread out the workers, but it currently isn't.
+		 */
+		foreach(lc, partial_subpaths)
+		{
+			Path	   *path = lfirst(lc);
+
+			parallel_degree = Max(parallel_degree, path->parallel_degree);
+		}
+		Assert(parallel_degree > 0);
+
+		/* Generate a partial append path. */
+		appendpath = create_append_path(rel, partial_subpaths, NULL,
+										parallel_degree);
+		add_partial_path(rel, (Path *) appendpath);
+
+		/* Consider gathering it. */
+		generate_gather_paths(root, rel);
+	}
 
 	/*
 	 * Also build unparameterized MergeAppend paths based on the collected
@@ -1214,7 +1273,7 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 		if (subpaths_valid)
 			add_path(rel, (Path *)
-					 create_append_path(rel, subpaths, required_outer));
+					 create_append_path(rel, subpaths, required_outer, 0));
 	}
 }
 
@@ -1440,8 +1499,9 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 
 	/* Discard any pre-existing paths; no further need for them */
 	rel->pathlist = NIL;
+	rel->partial_pathlist = NIL;
 
-	add_path(rel, (Path *) create_append_path(rel, NIL, NULL));
+	add_path(rel, (Path *) create_append_path(rel, NIL, NULL, 0));
 
 	/*
 	 * We set the cheapest path immediately, to ensure that IS_DUMMY_REL()
@@ -1844,6 +1904,36 @@ set_worktable_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 }
 
 /*
+ * generate_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather on
+ *		top of a partial path.
+ */
+void
+generate_gather_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	Path	   *cheapest_partial_path;
+	Path	   *simple_gather_path;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/*
+	 * The output of Gather is currently always unsorted, so there's only one
+	 * partial path of interest: the cheapest one.
+	 *
+	 * Eventually, we should have a Gather Merge operation that can merge
+	 * multiple tuple streams together while preserving their ordering.  We
+	 * could usefully generate such a path from each partial path that has
+	 * non-NIL pathkeys.
+	 */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+	simple_gather_path = (Path *)
+		create_gather_path(root, rel, cheapest_partial_path, NULL);
+	add_path(rel, simple_gather_path);
+}
+
+/*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
  *
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 990486c..1e76c03 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -186,24 +186,34 @@ clamp_row_est(double nrows)
  */
 void
 cost_seqscan(Path *path, PlannerInfo *root,
-			 RelOptInfo *baserel, ParamPathInfo *param_info,
-			 int nworkers)
+			 RelOptInfo *baserel, ParamPathInfo *param_info)
 {
 	Cost		startup_cost = 0;
 	Cost		run_cost = 0;
 	double		spc_seq_page_cost;
 	QualCost	qpqual_cost;
 	Cost		cpu_per_tuple;
+	double		parallel_divisor = 1;
 
 	/* Should only be applied to base relations */
 	Assert(baserel->relid > 0);
 	Assert(baserel->rtekind == RTE_RELATION);
 
+	/*
+	 * Primitive parallel cost model.  Assume the leader will do half as much
+	 * work as a regular worker, because it will also need to read the tuples
+	 * returned by the workers when they percolate up to the gather ndoe.
+	 * This is almost certainly not exactly the right way to model this, so
+	 * this will probably need to be changed at some point...
+	 */
+	if (path->parallel_degree > 0)
+		parallel_divisor = path->parallel_degree + 0.5;
+
 	/* Mark the path with the correct row estimate */
 	if (param_info)
-		path->rows = param_info->ppi_rows;
+		path->rows = param_info->ppi_rows / parallel_divisor;
 	else
-		path->rows = baserel->rows;
+		path->rows = baserel->rows / parallel_divisor;
 
 	if (!enable_seqscan)
 		startup_cost += disable_cost;
@@ -216,24 +226,14 @@ cost_seqscan(Path *path, PlannerInfo *root,
 	/*
 	 * disk costs
 	 */
-	run_cost += spc_seq_page_cost * baserel->pages;
+	run_cost += spc_seq_page_cost * baserel->pages / parallel_divisor;
 
 	/* CPU costs */
 	get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
 
 	startup_cost += qpqual_cost.startup;
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
-	run_cost += cpu_per_tuple * baserel->tuples;
-
-	/*
-	 * Primitive parallel cost model.  Assume the leader will do half as much
-	 * work as a regular worker, because it will also need to read the tuples
-	 * returned by the workers when they percolate up to the gather ndoe.
-	 * This is almost certainly not exactly the right way to model this, so
-	 * this will probably need to be changed at some point...
-	 */
-	if (nworkers > 0)
-		run_cost = run_cost / (nworkers + 0.5);
+	run_cost += cpu_per_tuple * baserel->tuples / parallel_divisor;
 
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 53d8fdd..33268d1 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -34,6 +34,12 @@ static void sort_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 static void match_unsorted_outer(PlannerInfo *root, RelOptInfo *joinrel,
 					 RelOptInfo *outerrel, RelOptInfo *innerrel,
 					 JoinType jointype, JoinPathExtraData *extra);
+static void consider_parallel_nestloop(PlannerInfo *root,
+						   RelOptInfo *joinrel,
+						   RelOptInfo *outerrel,
+						   RelOptInfo *innerrel,
+						   JoinType jointype,
+						   JoinPathExtraData *extra);
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 					 RelOptInfo *outerrel, RelOptInfo *innerrel,
 					 JoinType jointype, JoinPathExtraData *extra);
@@ -216,7 +222,12 @@ add_paths_to_joinrel(PlannerInfo *root,
 												 jointype, &extra);
 
 	/*
-	 * 6. Finally, give extensions a chance to manipulate the path list.
+	 * 6. Consider gathering partial paths.
+	 */
+	generate_gather_paths(root, joinrel);
+
+	/*
+	 * 7. Finally, give extensions a chance to manipulate the path list.
 	 */
 	if (set_join_pathlist_hook)
 		set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
@@ -330,6 +341,62 @@ try_nestloop_path(PlannerInfo *root,
 }
 
 /*
+ * try_partial_nestloop_path
+ *	  Consider a partial nestloop join path; if it appears useful, push it into
+ *	  the joinrel's partial_pathlist via add_partial_path().
+ */
+static void
+try_partial_nestloop_path(PlannerInfo *root,
+				  RelOptInfo *joinrel,
+				  Path *outer_path,
+				  Path *inner_path,
+				  List *pathkeys,
+				  JoinType jointype,
+				  JoinPathExtraData *extra)
+{
+	JoinCostWorkspace workspace;
+
+	/*
+	 * If the inner path is parameterized, the parameterization must be fully
+	 * satisfied by the proposed outer path.  Parameterized partial paths are
+	 * not supported.  The caller should already have verified that no
+	 * extra_lateral_rels are required here.
+	 */
+	Assert(bms_is_empty(joinrel->lateral_relids));
+	if (inner_path->param_info != NULL)
+	{
+		Relids		inner_paramrels = inner_path->param_info->ppi_req_outer;
+
+		if (!bms_is_subset(inner_paramrels, outer_path->parent->relids))
+			return;
+	}
+
+	/*
+	 * Before creating a path, get a quick lower bound on what it is likely
+	 * to cost.  Bail out right away if it looks terrible.
+	 */
+	initial_cost_nestloop(root, &workspace, jointype,
+						  outer_path, inner_path,
+						  extra->sjinfo, &extra->semifactors);
+	if (!add_partial_path_precheck(joinrel, workspace.total_cost, pathkeys))
+		return;
+
+	/* Might be good enough to be worth trying, so let's try it. */
+	add_partial_path(joinrel, (Path *)
+			 create_nestloop_path(root,
+								  joinrel,
+								  jointype,
+								  &workspace,
+								  extra->sjinfo,
+								  &extra->semifactors,
+								  outer_path,
+								  inner_path,
+								  extra->restrictlist,
+								  pathkeys,
+								  NULL));
+}
+
+/*
  * try_mergejoin_path
  *	  Consider a merge join path; if it appears useful, push it into
  *	  the joinrel's pathlist via add_path().
@@ -472,6 +539,62 @@ try_hashjoin_path(PlannerInfo *root,
 }
 
 /*
+ * try_partial_hashjoin_path
+ *	  Consider a partial hashjoin join path; if it appears useful, push it into
+ *	  the joinrel's partial_pathlist via add_partial_path().
+ */
+static void
+try_partial_hashjoin_path(PlannerInfo *root,
+						  RelOptInfo *joinrel,
+						  Path *outer_path,
+						  Path *inner_path,
+						  List *hashclauses,
+						  JoinType jointype,
+						  JoinPathExtraData *extra)
+{
+	JoinCostWorkspace workspace;
+
+	/*
+	 * If the inner path is parameterized, the parameterization must be fully
+	 * satisfied by the proposed outer path.  Parameterized partial paths are
+	 * not supported.  The caller should already have verified that no
+	 * extra_lateral_rels are required here.
+	 */
+	Assert(bms_is_empty(joinrel->lateral_relids));
+	if (inner_path->param_info != NULL)
+	{
+		Relids		inner_paramrels = inner_path->param_info->ppi_req_outer;
+
+		if (!bms_is_empty(inner_paramrels))
+			return;
+	}
+
+	/*
+	 * Before creating a path, get a quick lower bound on what it is likely
+	 * to cost.  Bail out right away if it looks terrible.
+	 */
+	initial_cost_hashjoin(root, &workspace, jointype, hashclauses,
+						  outer_path, inner_path,
+						  extra->sjinfo, &extra->semifactors);
+	if (!add_partial_path_precheck(joinrel, workspace.total_cost, NIL))
+		return;
+
+	/* Might be good enough to be worth trying, so let's try it. */
+	add_partial_path(joinrel, (Path *)
+			 create_hashjoin_path(root,
+								  joinrel,
+								  jointype,
+								  &workspace,
+								  extra->sjinfo,
+								  &extra->semifactors,
+								  outer_path,
+								  inner_path,
+								  extra->restrictlist,
+								  NULL,
+								  hashclauses));
+}
+
+/*
  * clause_sides_match_join
  *	  Determine whether a join clause is of the right form to use in this join.
  *
@@ -1063,6 +1186,85 @@ match_unsorted_outer(PlannerInfo *root,
 				break;
 		}
 	}
+
+	/*
+	 * If the joinrel is parallel-safe and the join type supports nested loops,
+	 * we may be able to consider a partial nestloop plan.  However, we can't
+	 * handle JOIN_UNIQUE_OUTER, because the outer path will be partial, and
+	 * therefore we won't be able to properly guarantee uniqueness.  Nor can
+	 * we handle extra_lateral_rels, since partial paths must not be
+	 * parameterized.
+	 */
+	if (joinrel->consider_parallel && nestjoinOK &&
+		save_jointype != JOIN_UNIQUE_OUTER &&
+		bms_is_empty(joinrel->lateral_relids))
+		consider_parallel_nestloop(root, joinrel, outerrel, innerrel,
+								   save_jointype, extra);
+}
+
+/*
+ * consider_parallel_nestloop
+ *	  Try to build partial paths for a joinrel by joining a partial path for the
+ *	  outer relation to a complete path for the inner relation.
+ *
+ * 'joinrel' is the join relation
+ * 'outerrel' is the outer join relation
+ * 'innerrel' is the inner join relation
+ * 'jointype' is the type of join to do
+ * 'extra' contains additional input values
+ */
+static void
+consider_parallel_nestloop(PlannerInfo *root,
+						   RelOptInfo *joinrel,
+						   RelOptInfo *outerrel,
+						   RelOptInfo *innerrel,
+						   JoinType jointype,
+						   JoinPathExtraData *extra)
+{
+	ListCell   *lc1;
+
+	foreach(lc1, outerrel->partial_pathlist)
+	{
+		Path	   *outerpath = (Path *) lfirst(lc1);
+		List	   *pathkeys;
+		ListCell   *lc2;
+
+		/* Figure out what useful ordering any paths we create will have. */
+		pathkeys = build_join_pathkeys(root, joinrel, jointype,
+									   outerpath->pathkeys);
+
+		/*
+		 * Try the cheapest parameterized paths; only those which will
+		 * produce an unparameterized path when joined to this outerrel
+		 * will survive try_partial_nestloop_path.  The cheapest
+		 * unparameterized path is also in this list.
+		 */
+		foreach(lc2, innerrel->cheapest_parameterized_paths)
+		{
+			Path	   *innerpath = (Path *) lfirst(lc2);
+
+			/* Can't join to an inner path that is not parallel-safe */
+			if (!innerpath->parallel_safe)
+				continue;
+
+			/*
+			 * Like match_unsorted_outer, we only consider a single nestloop
+			 * path when the jointype is JOIN_UNIQUE_INNER.  But we have to scan
+			 * cheapest_parameterized_paths to find the one we want to consider,
+			 * because cheapest_total_path might not be parallel-safe.
+			 */
+			if (jointype == JOIN_UNIQUE_INNER)
+			{
+				if (!bms_is_empty(PATH_REQ_OUTER(innerpath)))
+					continue;
+				innerpath = (Path *) create_unique_path(root, innerrel,
+											   innerpath, extra->sjinfo);
+			}
+
+			try_partial_nestloop_path(root, joinrel, outerpath, innerpath,
+									  pathkeys, jointype, extra);
+		}
+	}
 }
 
 /*
@@ -1240,6 +1442,55 @@ hash_inner_and_outer(PlannerInfo *root,
 				}
 			}
 		}
+
+		/*
+		 * If the joinrel is parallel-safe, we may be able to consider a
+		 * partial hash join.  However, we can't handle JOIN_UNIQUE_OUTER,
+		 * because the outer path will be partial, and therefore we won't be
+		 * able to properly guarantee uniqueness.  Also, the resulting path
+		 * must not be parameterized.
+		 */
+		if (joinrel->consider_parallel && jointype != JOIN_UNIQUE_OUTER &&
+			outerrel->partial_pathlist != NIL &&
+			bms_is_empty(joinrel->lateral_relids))
+		{
+			Path   *cheapest_partial_outer;
+			Path   *cheapest_safe_inner = NULL;
+
+			cheapest_partial_outer =
+				(Path *) linitial(outerrel->partial_pathlist);
+
+			/*
+			 * Normally, given that the joinrel is parallel-safe, the cheapest
+			 * total inner path will also be parallel-safe, but if not, we'll
+			 * have to search cheapest_parameterized_paths for the cheapest
+			 * unparameterized inner path.
+			 */
+			if (cheapest_total_inner->parallel_safe)
+				cheapest_safe_inner = cheapest_total_inner;
+			else
+			{
+				ListCell   *lc;
+
+				foreach(lc, innerrel->cheapest_parameterized_paths)
+				{
+					Path	   *innerpath = (Path *) lfirst(lc);
+
+					if (innerpath->parallel_safe &&
+						bms_is_empty(PATH_REQ_OUTER(innerpath)))
+					{
+						cheapest_safe_inner = innerpath;
+						break;
+					}
+				}
+			}
+
+			if (cheapest_safe_inner != NULL)
+				try_partial_hashjoin_path(root, joinrel,
+										  cheapest_partial_outer,
+										  cheapest_safe_inner,
+										  hashclauses, jointype, extra);
+		}
 	}
 }
 
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index ad58058..e32f445 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1194,9 +1194,10 @@ mark_dummy_rel(RelOptInfo *rel)
 
 	/* Evict any previously chosen paths */
 	rel->pathlist = NIL;
+	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(rel, NIL, NULL));
+	add_path(rel, (Path *) create_append_path(rel, NIL, NULL, 0));
 
 	/* Set or update cheapest_total_path and related fields */
 	set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 32f903d..94e5f26 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1125,7 +1125,7 @@ create_gather_plan(PlannerInfo *root, GatherPath *best_path)
 
 	gather_plan = make_gather(subplan->targetlist,
 							  NIL,
-							  best_path->num_workers,
+							  best_path->path.parallel_degree,
 							  best_path->single_copy,
 							  subplan);
 
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index 894f968..a8ff2ff 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -84,7 +84,8 @@ query_planner(PlannerInfo *root, List *tlist,
 
 		/* The only path for it is a trivial Result path */
 		add_path(final_rel, (Path *)
-				 create_result_path((List *) parse->jointree->quals));
+				 create_result_path(final_rel,
+									(List *) parse->jointree->quals));
 
 		/* Select cheapest path (pretty easy in this case...) */
 		set_cheapest(final_rel);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index ec0910d..d1c8e00 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -217,7 +217,12 @@ compare_path_costs_fuzzily(Path *path1, Path *path2, double fuzz_factor)
  * The cheapest_parameterized_paths list collects all parameterized paths
  * that have survived the add_path() tournament for this relation.  (Since
  * add_path ignores pathkeys for a parameterized path, these will be paths
- * that have best cost or best row count for their parameterization.)
+ * that have best cost or best row count for their parameterization.  We
+ * may also have both a parallel-safe and a non-parallel-safe path in some
+ * cases for the same parameterization in some cases, but this should be
+ * relatively rare since, most typically, all paths for the same relation
+ * will be parallel-safe or none of them will.)
+ *
  * cheapest_parameterized_paths always includes the cheapest-total
  * unparameterized path, too, if there is one; the users of that list find
  * it more convenient if that's included.
@@ -352,11 +357,12 @@ set_cheapest(RelOptInfo *parent_rel)
  *	  A path is worthy if it has a better sort order (better pathkeys) or
  *	  cheaper cost (on either dimension), or generates fewer rows, than any
  *	  existing path that has the same or superset parameterization rels.
+ *	  We also consider parallel-safe paths more worthy than others.
  *
  *	  We also remove from the rel's pathlist any old paths that are dominated
  *	  by new_path --- that is, new_path is cheaper, at least as well ordered,
- *	  generates no more rows, and requires no outer rels not required by the
- *	  old path.
+ *	  generates no more rows, requires no outer rels not required by the old
+ *	  path, and is no less parallel-safe.
  *
  *	  In most cases, a path with a superset parameterization will generate
  *	  fewer rows (since it has more join clauses to apply), so that those two
@@ -470,14 +476,16 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 						{
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET1) &&
-								new_path->rows <= old_path->rows)
+								new_path->rows <= old_path->rows &&
+								new_path->parallel_safe >= old_path->parallel_safe)
 								remove_old = true;		/* new dominates old */
 						}
 						else if (keyscmp == PATHKEYS_BETTER2)
 						{
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET2) &&
-								new_path->rows >= old_path->rows)
+								new_path->rows >= old_path->rows &&
+								new_path->parallel_safe <= old_path->parallel_safe)
 								accept_new = false;		/* old dominates new */
 						}
 						else	/* keyscmp == PATHKEYS_EQUAL */
@@ -487,19 +495,25 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 								/*
 								 * Same pathkeys and outer rels, and fuzzily
 								 * the same cost, so keep just one; to decide
-								 * which, first check rows and then do a fuzzy
-								 * cost comparison with very small fuzz limit.
-								 * (We used to do an exact cost comparison,
-								 * but that results in annoying
-								 * platform-specific plan variations due to
-								 * roundoff in the cost estimates.)  If things
-								 * are still tied, arbitrarily keep only the
-								 * old path.  Notice that we will keep only
-								 * the old path even if the less-fuzzy
-								 * comparison decides the startup and total
-								 * costs compare differently.
+								 * which, first check parallel-safety, then
+								 * rows, then do a fuzzy cost comparison with
+								 * very small fuzz limit.  (We used to do an
+								 * exact cost comparison, but that results in
+								 * annoying platform-specific plan variations
+								 * due to roundoff in the cost estimates.)	If
+								 * things are still tied, arbitrarily keep
+								 * only the old path.  Notice that we will
+								 * keep only the old path even if the
+								 * less-fuzzy comparison decides the startup
+								 * and total costs compare differently.
 								 */
-								if (new_path->rows < old_path->rows)
+								if (new_path->parallel_safe >
+									old_path->parallel_safe)
+									remove_old = true;	/* new dominates old */
+								else if (new_path->parallel_safe <
+										 old_path->parallel_safe)
+									accept_new = false; /* old dominates new */
+								else if (new_path->rows < old_path->rows)
 									remove_old = true;	/* new dominates old */
 								else if (new_path->rows > old_path->rows)
 									accept_new = false; /* old dominates new */
@@ -512,10 +526,12 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 														 * dominates new */
 							}
 							else if (outercmp == BMS_SUBSET1 &&
-									 new_path->rows <= old_path->rows)
+									 new_path->rows <= old_path->rows &&
+									 new_path->parallel_safe >= old_path->parallel_safe)
 								remove_old = true;		/* new dominates old */
 							else if (outercmp == BMS_SUBSET2 &&
-									 new_path->rows >= old_path->rows)
+									 new_path->rows >= old_path->rows &&
+									 new_path->parallel_safe <= old_path->parallel_safe)
 								accept_new = false;		/* old dominates new */
 							/* else different parameterizations, keep both */
 						}
@@ -527,7 +543,8 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 												   PATH_REQ_OUTER(old_path));
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET1) &&
-								new_path->rows <= old_path->rows)
+								new_path->rows <= old_path->rows &&
+								new_path->parallel_safe >= old_path->parallel_safe)
 								remove_old = true;		/* new dominates old */
 						}
 						break;
@@ -538,7 +555,8 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 												   PATH_REQ_OUTER(old_path));
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET2) &&
-								new_path->rows >= old_path->rows)
+								new_path->rows >= old_path->rows &&
+								new_path->parallel_safe <= old_path->parallel_safe)
 								accept_new = false;		/* old dominates new */
 						}
 						break;
@@ -685,6 +703,214 @@ add_path_precheck(RelOptInfo *parent_rel,
 	return true;
 }
 
+/*
+ * add_partial_path
+ *	  Like add_path, our goal here is to consider whether a path is worthy
+ *	  of being kept around, but the considerations here are a bit different.
+ *	  A partial path is one which can be executed in any number of workers in
+ *	  parallel such that each worker will generate a subset of the path's
+ *	  overall result.
+ *
+ *	  We don't generate parameterized partial paths for several reasons.  Most
+ *	  importantly, they're not safe to execute, because there's nothing to
+ *	  make sure that a parallel scan within the parameterized portion of the
+ *	  plan is running with the same value in every worker at the same time.
+ *	  Fortunately, it seems unlikely to be worthwhile anyway, because having
+ *	  each worker scan the entire outer relation and a subset of the inner
+ *	  relation will generally be a terrible plan.  The inner (parameterized)
+ *	  side of the plan will be small anyway.  There could be rare cases where
+ *	  this wins big - e.g. if join order constraints put a 1-row relation on
+ *	  the outer side of the topmost join with a parameterized plan on the inner
+ *	  side - but we'll have to be content not to handle such cases until somebody
+ *	  builds an executor infrastructure that can cope with them.
+ *
+ *	  Because we don't consider parameterized paths here, we also don't
+ *	  need to consider the row counts as a measure of quality: every path will
+ *	  produce the same number of rows.  Neither do we need to consider startup
+ *	  costs: parallelism is only used for plans that will be run to completion.
+ *	  Therefore, this routine is much simpler than add_path: it needs to
+ *	  consider only pathkeys and total cost.
+ */
+void
+add_partial_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	bool		accept_new = true;		/* unless we find a superior old path */
+	ListCell   *insert_after = NULL;	/* where to insert new item */
+	ListCell   *p1;
+	ListCell   *p1_prev;
+	ListCell   *p1_next;
+
+	/* Check for query cancel. */
+	CHECK_FOR_INTERRUPTS();
+
+	/*
+	 * As in add_path, throw out any paths which are dominated by the new
+	 * path, but throw out the new path if some existing path dominates it.
+	 */
+	p1_prev = NULL;
+	for (p1 = list_head(parent_rel->partial_pathlist); p1 != NULL;
+		 p1 = p1_next)
+	{
+		Path	   *old_path = (Path *) lfirst(p1);
+		bool		remove_old = false; /* unless new proves superior */
+		PathKeysComparison keyscmp;
+
+		p1_next = lnext(p1);
+
+		/* Compare pathkeys. */
+		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
+
+		/* Unless pathkeys are incompable, keep just one of the two paths. */
+		if (keyscmp != PATHKEYS_DIFFERENT)
+		{
+			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
+			{
+				/* New path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER1)
+					accept_new = false;
+			}
+			else if (old_path->total_cost > new_path->total_cost
+					 * STD_FUZZ_FACTOR)
+			{
+				/* Old path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER2)
+					remove_old = true;
+			}
+			else if (keyscmp == PATHKEYS_BETTER1)
+			{
+				/* Costs are about the same, new path has better pathkeys. */
+				remove_old = true;
+			}
+			else if (keyscmp == PATHKEYS_BETTER2)
+			{
+				/* Costs are about the same, old path has better pathkeys. */
+				accept_new = false;
+			}
+			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
+			{
+				/* Pathkeys are the same, and the old path costs more. */
+				remove_old = true;
+			}
+			else
+			{
+				/*
+				 * Pathkeys are the same, and new path isn't materially
+				 * cheaper.
+				 */
+				accept_new = false;
+			}
+		}
+
+		/*
+		 * Remove current element from partial_pathlist if dominated by new.
+		 */
+		if (remove_old)
+		{
+			parent_rel->partial_pathlist =
+				list_delete_cell(parent_rel->partial_pathlist, p1, p1_prev);
+			/* add_path has a special case for IndexPath; we don't need it */
+			Assert(!IsA(old_path, IndexPath));
+			pfree(old_path);
+			/* p1_prev does not advance */
+		}
+		else
+		{
+			/* new belongs after this old path if it has cost >= old's */
+			if (new_path->total_cost >= old_path->total_cost)
+				insert_after = p1;
+			/* p1_prev advances */
+			p1_prev = p1;
+		}
+
+		/*
+		 * If we found an old path that dominates new_path, we can quit
+		 * scanning the partial_pathlist; we will not add new_path, and we
+		 * assume new_path cannot dominate any later path.
+		 */
+		if (!accept_new)
+			break;
+	}
+
+	if (accept_new)
+	{
+		/* Accept the new path: insert it at proper place */
+		if (insert_after)
+			lappend_cell(parent_rel->partial_pathlist, insert_after, new_path);
+		else
+			parent_rel->partial_pathlist =
+				lcons(new_path, parent_rel->partial_pathlist);
+	}
+	else
+	{
+		/* add_path has a special case for IndexPath; we don't need it */
+		Assert(!IsA(new_path, IndexPath));
+		/* Reject and recycle the new path */
+		pfree(new_path);
+	}
+}
+
+/*
+ * add_partial_path_precheck
+ *	  Check whether a proposed new partial path could possibly get accepted.
+ *
+ * Unlike add_path_precheck, we can ignore startup cost and parameterization,
+ * since they don't matter for partial paths (see add_partial_path).  But
+ * we do want to make sure we don't add a partial path if there's already
+ * a complete path that dominates it, since in that case the proposed path
+ * is surely a loser.
+ */
+bool
+add_partial_path_precheck(RelOptInfo *parent_rel, Cost total_cost,
+						  List *pathkeys)
+{
+	ListCell   *p1;
+
+	/*
+	 * Our goal here is twofold.  First, we want to find out whether this path
+	 * is clearly inferior to some existing partial path.  If so, we want to
+	 * reject it immediately.  Second, we want to find out whether this path
+	 * is clearly superior to some existing partial path -- at least, modulo
+	 * final cost computations.  If so, we definitely want to consider it.
+	 *
+	 * Unlike add_path(), we always compare pathkeys here.  This is because we
+	 * expect partial_pathlist to be very short, and getting a definitive
+	 * answer at this stage avoids the need to call add_path_precheck.
+	 */
+	foreach(p1, parent_rel->partial_pathlist)
+	{
+		Path	   *old_path = (Path *) lfirst(p1);
+		PathKeysComparison keyscmp;
+
+		keyscmp = compare_pathkeys(pathkeys, old_path->pathkeys);
+		if (keyscmp != PATHKEYS_DIFFERENT)
+		{
+			if (total_cost > old_path->total_cost * STD_FUZZ_FACTOR &&
+				keyscmp != PATHKEYS_BETTER1)
+				return false;
+			if (old_path->total_cost > total_cost * STD_FUZZ_FACTOR &&
+				keyscmp != PATHKEYS_BETTER2)
+				return true;
+		}
+	}
+
+	/*
+	 * This path is neither clearly inferior to an existing partial path nor
+	 * clearly good enough that it might replace one.  Compare it to
+	 * non-parallel plans.  If it loses even before accounting for the cost of
+	 * the Gather node, we should definitely reject it.
+	 *
+	 * Note that we pass the total_cost to add_path_precheck twice.  This is
+	 * because it's never advantageous to consider the startup cost of a
+	 * partial path; the resulting plans, if run in parallel, will be run to
+	 * completion.
+	 */
+	if (!add_path_precheck(parent_rel, total_cost, total_cost, pathkeys,
+						   NULL))
+		return false;
+
+	return true;
+}
+
 
 /*****************************************************************************
  *		PATH NODE CREATION ROUTINES
@@ -697,7 +923,7 @@ add_path_precheck(RelOptInfo *parent_rel,
  */
 Path *
 create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
-					Relids required_outer, int nworkers)
+					Relids required_outer, int parallel_degree)
 {
 	Path	   *pathnode = makeNode(Path);
 
@@ -705,10 +931,12 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->parent = rel;
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
-	pathnode->parallel_aware = nworkers > 0 ? true : false;
+	pathnode->parallel_aware = parallel_degree > 0 ? true : false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = parallel_degree;
 	pathnode->pathkeys = NIL;	/* seqscan has unordered result */
 
-	cost_seqscan(pathnode, root, rel, pathnode->param_info, nworkers);
+	cost_seqscan(pathnode, root, rel, pathnode->param_info);
 
 	return pathnode;
 }
@@ -727,6 +955,8 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* samplescan has unordered result */
 
 	cost_samplescan(pathnode, root, rel, pathnode->param_info);
@@ -781,6 +1011,8 @@ create_index_path(PlannerInfo *root,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = pathkeys;
 
 	/* Convert clauses to indexquals the executor can handle */
@@ -827,6 +1059,8 @@ create_bitmap_heap_path(PlannerInfo *root,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = bitmapqual->parallel_safe;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->bitmapqual = bitmapqual;
@@ -852,7 +1086,17 @@ create_bitmap_and_path(PlannerInfo *root,
 	pathnode->path.pathtype = T_BitmapAnd;
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = NULL;	/* not used in bitmap trees */
+
+	/*
+	 * Currently, a BitmapHeapPath, BitmapAndPath, or BitmapOrPath will be
+	 * parallel-safe if and only if rel->consider_parallel is set.  So, we can
+	 * set the flag for this path based only on the relation-level flag,
+	 * without actually iterating over the list of children.
+	 */
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
+
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->bitmapquals = bitmapquals;
@@ -877,7 +1121,17 @@ create_bitmap_or_path(PlannerInfo *root,
 	pathnode->path.pathtype = T_BitmapOr;
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = NULL;	/* not used in bitmap trees */
+
+	/*
+	 * Currently, a BitmapHeapPath, BitmapAndPath, or BitmapOrPath will be
+	 * parallel-safe if and only if rel->consider_parallel is set.  So, we can
+	 * set the flag for this path based only on the relation-level flag,
+	 * without actually iterating over the list of children.
+	 */
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
+
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->bitmapquals = bitmapquals;
@@ -903,6 +1157,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->tidquals = tidquals;
@@ -921,7 +1177,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
  * Note that we must handle subpaths = NIL, representing a dummy access path.
  */
 AppendPath *
-create_append_path(RelOptInfo *rel, List *subpaths, Relids required_outer)
+create_append_path(RelOptInfo *rel, List *subpaths, Relids required_outer,
+				   int parallel_degree)
 {
 	AppendPath *pathnode = makeNode(AppendPath);
 	ListCell   *l;
@@ -931,6 +1188,8 @@ create_append_path(RelOptInfo *rel, List *subpaths, Relids required_outer)
 	pathnode->path.param_info = get_appendrel_parampathinfo(rel,
 															required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = parallel_degree;
 	pathnode->path.pathkeys = NIL;		/* result is always considered
 										 * unsorted */
 	pathnode->subpaths = subpaths;
@@ -955,6 +1214,8 @@ create_append_path(RelOptInfo *rel, List *subpaths, Relids required_outer)
 		if (l == list_head(subpaths))	/* first node? */
 			pathnode->path.startup_cost = subpath->startup_cost;
 		pathnode->path.total_cost += subpath->total_cost;
+		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
+			subpath->parallel_safe;
 
 		/* All child paths must have same parameterization */
 		Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));
@@ -985,6 +1246,8 @@ create_merge_append_path(PlannerInfo *root,
 	pathnode->path.param_info = get_appendrel_parampathinfo(rel,
 															required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = pathkeys;
 	pathnode->subpaths = subpaths;
 
@@ -1008,6 +1271,8 @@ create_merge_append_path(PlannerInfo *root,
 		Path	   *subpath = (Path *) lfirst(l);
 
 		pathnode->path.rows += subpath->rows;
+		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
+			subpath->parallel_safe;
 
 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
 		{
@@ -1052,7 +1317,7 @@ create_merge_append_path(PlannerInfo *root,
  *	  This is only used for the case of a query with an empty jointree.
  */
 ResultPath *
-create_result_path(List *quals)
+create_result_path(RelOptInfo *rel, List *quals)
 {
 	ResultPath *pathnode = makeNode(ResultPath);
 
@@ -1060,6 +1325,8 @@ create_result_path(List *quals)
 	pathnode->path.parent = NULL;
 	pathnode->path.param_info = NULL;	/* there are no other rels... */
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;
 	pathnode->quals = quals;
 
@@ -1094,6 +1361,8 @@ create_material_path(RelOptInfo *rel, Path *subpath)
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = subpath->param_info;
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = subpath->parallel_safe;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
@@ -1155,6 +1424,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = subpath->param_info;
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = subpath->parallel_safe;
+	pathnode->path.parallel_degree = 0;
 
 	/*
 	 * Assume the output is unsorted, since we don't necessarily have pathkeys
@@ -1328,19 +1599,30 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
  */
 GatherPath *
 create_gather_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
-				   Relids required_outer, int nworkers)
+				   Relids required_outer)
 {
 	GatherPath *pathnode = makeNode(GatherPath);
 
+	Assert(subpath->parallel_safe);
+
 	pathnode->path.pathtype = T_Gather;
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = false;
+	pathnode->path.parallel_degree = subpath->parallel_degree;
 	pathnode->path.pathkeys = NIL;		/* Gather has unordered result */
 
 	pathnode->subpath = subpath;
-	pathnode->num_workers = nworkers;
+	pathnode->single_copy = false;
+
+	if (pathnode->path.parallel_degree == 0)
+	{
+		pathnode->path.parallel_degree = 1;
+		pathnode->path.pathkeys = subpath->pathkeys;
+		pathnode->single_copy = true;
+	}
 
 	cost_gather(pathnode, root, rel, pathnode->path.param_info);
 
@@ -1393,6 +1675,8 @@ create_subqueryscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = pathkeys;
 
 	cost_subqueryscan(pathnode, root, rel, pathnode->param_info);
@@ -1416,6 +1700,8 @@ create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = pathkeys;
 
 	cost_functionscan(pathnode, root, rel, pathnode->param_info);
@@ -1439,6 +1725,8 @@ create_valuesscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* result is always unordered */
 
 	cost_valuesscan(pathnode, root, rel, pathnode->param_info);
@@ -1461,6 +1749,8 @@ create_ctescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* XXX for now, result is always unordered */
 
 	cost_ctescan(pathnode, root, rel, pathnode->param_info);
@@ -1484,6 +1774,8 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* result is always unordered */
 
 	/* Cost is the same as for a regular CTE scan */
@@ -1517,6 +1809,8 @@ create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.rows = rows;
 	pathnode->path.startup_cost = startup_cost;
 	pathnode->path.total_cost = total_cost;
@@ -1653,6 +1947,10 @@ create_nestloop_path(PlannerInfo *root,
 								  required_outer,
 								  &restrict_clauses);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = joinrel->consider_parallel &&
+		outer_path->parallel_safe && inner_path->parallel_safe;
+	/* This is a foolish way to estimate parallel_degree, but for now... */
+	pathnode->path.parallel_degree = outer_path->parallel_degree;
 	pathnode->path.pathkeys = pathkeys;
 	pathnode->jointype = jointype;
 	pathnode->outerjoinpath = outer_path;
@@ -1711,6 +2009,9 @@ create_mergejoin_path(PlannerInfo *root,
 								  required_outer,
 								  &restrict_clauses);
 	pathnode->jpath.path.parallel_aware = false;
+	pathnode->jpath.path.parallel_safe = joinrel->consider_parallel &&
+		outer_path->parallel_safe && inner_path->parallel_safe;
+	pathnode->jpath.path.parallel_degree = 0;
 	pathnode->jpath.path.pathkeys = pathkeys;
 	pathnode->jpath.jointype = jointype;
 	pathnode->jpath.outerjoinpath = outer_path;
@@ -1768,6 +2069,10 @@ create_hashjoin_path(PlannerInfo *root,
 								  required_outer,
 								  &restrict_clauses);
 	pathnode->jpath.path.parallel_aware = false;
+	pathnode->jpath.path.parallel_safe = joinrel->consider_parallel &&
+		outer_path->parallel_safe && inner_path->parallel_safe;
+	/* This is a foolish way to estimate parallel_degree, but for now... */
+	pathnode->jpath.path.parallel_degree = outer_path->parallel_degree;
 
 	/*
 	 * A hashjoin never has pathkeys, since its output ordering is
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index f2bdfcc..29ce75f 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -107,6 +107,7 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptKind reloptkind)
 	rel->reltargetlist = NIL;
 	rel->pathlist = NIL;
 	rel->ppilist = NIL;
+	rel->partial_pathlist = NIL;
 	rel->cheapest_startup_path = NULL;
 	rel->cheapest_total_path = NULL;
 	rel->cheapest_unique_path = NULL;
@@ -370,6 +371,7 @@ build_join_rel(PlannerInfo *root,
 	joinrel->reltargetlist = NIL;
 	joinrel->pathlist = NIL;
 	joinrel->ppilist = NIL;
+	joinrel->partial_pathlist = NIL;
 	joinrel->cheapest_startup_path = NULL;
 	joinrel->cheapest_total_path = NULL;
 	joinrel->cheapest_unique_path = NULL;
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 5393005..f9f13b4 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -458,6 +458,7 @@ typedef struct RelOptInfo
 	List	   *reltargetlist;	/* Vars to be output by scan of relation */
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
+	List	   *partial_pathlist;	/* partial Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -759,6 +760,8 @@ typedef struct Path
 	RelOptInfo *parent;			/* the relation this path can build */
 	ParamPathInfo *param_info;	/* parameterization info, or NULL if none */
 	bool		parallel_aware; /* engage parallel-aware logic? */
+	bool		parallel_safe;	/* OK to use as part of parallel plan? */
+	int			parallel_degree; /* desired parallel degree; 0 = not parallel */
 
 	/* estimated size/costs for path (see costsize.c for more info) */
 	double		rows;			/* estimated number of result tuples */
@@ -1062,7 +1065,6 @@ typedef struct GatherPath
 {
 	Path		path;
 	Path	   *subpath;		/* path for each worker */
-	int			num_workers;	/* number of workers sought to help */
 	bool		single_copy;	/* path must not be executed >1x */
 } GatherPath;
 
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac21a3a..25a7303 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -72,7 +72,7 @@ extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
-			 ParamPathInfo *param_info, int nworkers);
+			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 8fb9eda..4b24da0 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -29,9 +29,12 @@ extern void add_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 				  Cost startup_cost, Cost total_cost,
 				  List *pathkeys, Relids required_outer);
+extern void add_partial_path(RelOptInfo *parent_rel, Path *new_path);
+extern bool add_partial_path_precheck(RelOptInfo *parent_rel,
+						  Cost total_cost, List *pathkeys);
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
-					Relids required_outer, int nworkers);
+					Relids required_outer, int parallel_degree);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
 extern IndexPath *create_index_path(PlannerInfo *root,
@@ -59,19 +62,18 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(RelOptInfo *rel, List *subpaths,
-				   Relids required_outer);
+				   Relids required_outer, int parallel_degree);
 extern MergeAppendPath *create_merge_append_path(PlannerInfo *root,
 						 RelOptInfo *rel,
 						 List *subpaths,
 						 List *pathkeys,
 						 Relids required_outer);
-extern ResultPath *create_result_path(List *quals);
+extern ResultPath *create_result_path(RelOptInfo *rel, List *quals);
 extern MaterialPath *create_material_path(RelOptInfo *rel, Path *subpath);
 extern UniquePath *create_unique_path(PlannerInfo *root, RelOptInfo *rel,
 				   Path *subpath, SpecialJoinInfo *sjinfo);
 extern GatherPath *create_gather_path(PlannerInfo *root,
-				   RelOptInfo *rel, Path *subpath, Relids required_outer,
-				   int nworkers);
+				   RelOptInfo *rel, Path *subpath, Relids required_outer);
 extern Path *create_subqueryscan_path(PlannerInfo *root, RelOptInfo *rel,
 						 List *pathkeys, Relids required_outer);
 extern Path *create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7757741..4b01850 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -50,6 +50,8 @@ extern RelOptInfo *make_one_rel(PlannerInfo *root, List *joinlist);
 extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 					 List *initial_rels);
 
+extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel);
+
 #ifdef OPTIMIZER_DEBUG
 extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
-- 
2.5.4 (Apple Git-61)

parallel-reader-order-v1.patchtext/x-patch; charset=US-ASCII; name=parallel-reader-order-v1.patchDownload
From 0e006c14df600419b56efc9ce8d2eaffb1891a7a Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 18 Dec 2015 10:02:35 -0500
Subject: [PATCH 2/3] Don't advance to the next reader until we can't read
 anything from the current one.

---
 src/backend/executor/nodeGather.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index f32da1e..db5883d 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -359,14 +359,20 @@ gather_readnext(GatherState *gatherstate)
 			continue;
 		}
 
-		/* Advance nextreader pointer in round-robin fashion. */
-		gatherstate->nextreader =
-			(gatherstate->nextreader + 1) % gatherstate->nreaders;
-
 		/* If we got a tuple, return it. */
 		if (tup)
 			return tup;
 
+		/*
+		 * Advance nextreader pointer in round-robin fashion.  Note that we
+		 * only reach this code if we weren't able to get a tuple from the
+		 * current worker.  We used to advance the nextreader pointer after
+		 * every tuple, but it turns out to be much more efficient to keep
+		 * reading from the same queue until that would require blocking.
+		 */
+		gatherstate->nextreader =
+			(gatherstate->nextreader + 1) % gatherstate->nreaders;
+
 		/* Have we visited every TupleQueueReader? */
 		if (gatherstate->nextreader == waitpos)
 		{
-- 
2.5.4 (Apple Git-61)

gather-disuse-physical-tlist.patchtext/x-patch; charset=US-ASCII; name=gather-disuse-physical-tlist.patchDownload
From 5deb0e7bc1d289616d6fc9b94b83377a47d221fe Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 18 Dec 2015 10:06:21 -0500
Subject: [PATCH 3/3] Make Gather disuse_physical_tlist.

---
 src/backend/optimizer/plan/createplan.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 94e5f26..0c82529 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -558,7 +558,8 @@ use_physical_tlist(PlannerInfo *root, RelOptInfo *rel)
  * If the plan node immediately above a scan would prefer to get only
  * needed Vars and not a physical tlist, it must call this routine to
  * undo the decision made by use_physical_tlist().  Currently, Hash, Sort,
- * and Material nodes want this, so they don't have to store useless columns.
+ * Material, and Gather nodes want this, so they don't have to store or
+ * transfer useless columns.
  */
 static void
 disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
@@ -1123,6 +1124,8 @@ create_gather_plan(PlannerInfo *root, GatherPath *best_path)
 
 	subplan = create_plan_recurse(root, best_path->subpath);
 
+	disuse_physical_tlist(root, subplan, best_path->subpath);
+
 	gather_plan = make_gather(subplan->targetlist,
 							  NIL,
 							  best_path->path.parallel_degree,
-- 
2.5.4 (Apple Git-61)

#23Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#22)
Re: parallel joins, and better parallel explain

On Fri, Dec 18, 2015 at 8:47 PM Robert Wrote,

Yes, you are right, that create_gather_path() sets parallel_safe to false
unconditionally but whenever we are building a non partial path, that

time

we should carry forward the parallel_safe state to its parent, and it

seems

like that part is missing here..

Ah, right. Woops. I can't exactly replicate your results, but I've
attempted to fix this in a systematic way in the new version attached
here (parallel-join-v3.patch).

I Have tested with the latest patch, problem is solved..

During my testing i observed one more behaviour in the hash join, where
Parallel hash join is taking more time compared to Normal hash join,

Here i have ensured that apart from hash column there is one more condition
on other column which force Random page fetch....

I think this behaviour seems similar what Amit has given in above thread
/messages/by-id/CAA4eK1+s3uD2G1WSkEAW_FZgp8jeYw3YcnvtuepLihE_e1D7Dw@mail.gmail.com

create table t1 (c1 int, c2 int, c3 text);

create table t2 (c1 int, c2 int, c3 text);

insert into t1 values(generate_series(1,10000000),
generate_series(1,10000000), repeat('x', 1000));

insert into t2 values(generate_series(1,3000000),
generate_series(1,3000000), repeat('x', 5));
analyze t1;
analyze t2;
set max_parallel_degree=6;
postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1
AND t2.c2 + t1.c1 > 100;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=474378.39..474378.40 rows=1 width=0) (actual
time=34507.573..34507.573 rows=1 loops=1)
-> Gather (cost=96436.00..471878.39 rows=1000000 width=0) (actual
time=2004.186..33918.216 rows=2999950 loops=1)
Number of Workers: 6
-> Hash Join (cost=95436.00..370878.39 rows=1000000 width=0)
(actual time=2077.085..18651.868 rows=428564 loops=7)
Hash Cond: (t1.c1 = t2.c1)
Join Filter: ((t2.c2 + t1.c1) > 100)
Rows Removed by Join Filter: 7
-> Parallel Seq Scan on t1 (cost=0.00..235164.93
rows=1538462 width=4) (actual time=0.741..13199.231 rows=1428571 loops=7)
-> Hash (cost=46217.00..46217.00 rows=3000000 width=8)
(actual time=2070.827..2070.827 rows=3000000 loops=7)
Buckets: 131072 Batches: 64 Memory Usage: 2861kB
-> Seq Scan on t2 (cost=0.00..46217.00 rows=3000000
width=8) (actual time=0.027..904.607 rows=3000000 loops=7)
Planning time: 0.292 ms
Execution time: 34507.857 ms
(13 rows)

postgres=# set max_parallel_degree=0;
SET
postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1
AND t2.c2 + t1.c1 > 100;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1823853.06..1823853.07 rows=1 width=0) (actual
time=17833.067..17833.067 rows=1 loops=1)
-> Hash Join (cost=95436.00..1821353.06 rows=1000000 width=0) (actual
time=1286.788..17558.987 rows=2999950 loops=1)
Hash Cond: (t1.c1 = t2.c1)
Join Filter: ((t2.c2 + t1.c1) > 100)
Rows Removed by Join Filter: 50
-> Seq Scan on t1 (cost=0.00..1528572.04 rows=10000004 width=4)
(actual time=2.728..9881.659 rows=10000000 loops=1)
-> Hash (cost=46217.00..46217.00 rows=3000000 width=8) (actual
time=1279.688..1279.688 rows=3000000 loops=1)
Buckets: 131072 Batches: 64 Memory Usage: 2861kB
-> Seq Scan on t2 (cost=0.00..46217.00 rows=3000000
width=8) (actual time=0.029..588.887 rows=3000000 loops=1)
Planning time: 0.314 ms
Execution time: 17833.143 ms
(11 rows)

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Fri, Dec 18, 2015 at 8:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Show quoted text

On Fri, Dec 18, 2015 at 3:54 AM, Dilip Kumar <dilipbalaut@gmail.com>
wrote:

On Fri, Dec 18, 2015 at 7.59 AM Robert Haas <robertmhaas@gmail.com>

Wrote,

Uh oh. That's not supposed to happen. A GatherPath is supposed to
have parallel_safe = false, which should prevent the planner from
using it to form new partial paths. Is this with the latest version
of the patch? The plan output suggests that we're somehow reaching
try_partial_hashjoin_path() with inner_path being a GatherPath, but I
don't immediately see how that's possible, because
create_gather_path() sets parallel_safe to false unconditionally, and
hash_inner_and_outer() never sets cheapest_safe_inner to a path unless
that path is parallel_safe.

Yes, you are right, that create_gather_path() sets parallel_safe to false
unconditionally but whenever we are building a non partial path, that

time

we should carry forward the parallel_safe state to its parent, and it

seems

like that part is missing here..

Ah, right. Woops. I can't exactly replicate your results, but I've
attempted to fix this in a systematic way in the new version attached
here (parallel-join-v3.patch).

Do you have a self-contained test case that reproduces this, or any
insight as to how it's happening here?

This is TPC-H benchmark case:
we can setup like this..
1. git clone https://tkejser@bitbucket.org/tkejser/tpch-dbgen.git
2. complie using make
3. ./dbgen –v –s 5
4. ./qgen

Thanks. After a bit of fiddling I was able to get this to work. I'm
attaching two other patches that seem to help this case quite
considerably. The first (parallel-reader-order-v1) cause Gather to
read from the same worker repeatedly until it can't get another tuple
from that worker without blocking, and only then move on to the next
worker. With 4 workers, this seems to be drastically more efficient
than what's currently in master - I saw the time for Q5 drop from over
17 seconds to about 6 (this was an assert-enabled build running with
EXPLAIN ANALYZE, though, so take those numbers with a grain of salt).
The second (gather-disuse-physical-tlist.patch) causes Gather to force
underlying scan nodes to project, which is a good idea here for
reasons very similar to why it's a good idea for the existing node
types that use disuse_physical_tlist: forcing extra data through the
Gather node is bad. That shaved another half second off this query.

The exact query I was using for testing was:

explain (analyze, verbose) select n_name, sum(l_extendedprice * (1 -
l_discount)) as revenue from customer, orders, lineitem, supplier,
nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey
and l_suppkey = s_suppkey and c_nationkey = s_nationkey and
s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name =
'EUROPE' and o_orderdate >= date '1995-01-01' and o_orderdate < date
'1995-01-01' + interval '1' year group by n_name order by revenue
desc;

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#24Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#23)
Re: parallel joins, and better parallel explain

On Tue, Dec 22, 2015 at 4:14 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Dec 18, 2015 at 8:47 PM Robert Wrote,

Yes, you are right, that create_gather_path() sets parallel_safe to false
unconditionally but whenever we are building a non partial path, that
time
we should carry forward the parallel_safe state to its parent, and it
seems
like that part is missing here..

Ah, right. Woops. I can't exactly replicate your results, but I've
attempted to fix this in a systematic way in the new version attached
here (parallel-join-v3.patch).

I Have tested with the latest patch, problem is solved..

During my testing i observed one more behaviour in the hash join, where
Parallel hash join is taking more time compared to Normal hash join,

I think the gather-reader-order patch will fix this. Here's a test
with all three patches.

rhaas=# SET max_parallel_degree=0;SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
SET
Time: 0.192 ms
count
---------
2999950
(1 row)

Time: 11331.425 ms
rhaas=# SET max_parallel_degree=1;SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
SET
Time: 0.185 ms
count
---------
2999950
(1 row)

Time: 8796.190 ms
rhaas=# SET max_parallel_degree=2;SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
SET
Time: 0.192 ms
count
---------
2999950
(1 row)

Time: 8153.258 ms
rhaas=# SET max_parallel_degree=3;SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
SET
Time: 0.187 ms
count
---------
2999950
(1 row)

Time: 6198.163 ms
rhaas=# SET max_parallel_degree=4;SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
SET
Time: 0.190 ms
count
---------
2999950
(1 row)

Time: 7511.970 ms
rhaas=# SET max_parallel_degree=5;SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
SET
Time: 0.152 ms
count
---------
2999950
(1 row)

Time: 7651.862 ms
rhaas=# SET max_parallel_degree=6;SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
SET
Time: 0.195 ms
count
---------
2999950
(1 row)

Time: 7584.073 ms

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#24)
Re: parallel joins, and better parallel explain

On Tue, Dec 22, 2015 at 8:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Dec 22, 2015 at 4:14 AM, Dilip Kumar <dilipbalaut@gmail.com>
wrote:

On Fri, Dec 18, 2015 at 8:47 PM Robert Wrote,

Yes, you are right, that create_gather_path() sets parallel_safe to

false

unconditionally but whenever we are building a non partial path, that
time
we should carry forward the parallel_safe state to its parent, and it
seems
like that part is missing here..

Ah, right. Woops. I can't exactly replicate your results, but I've
attempted to fix this in a systematic way in the new version attached
here (parallel-join-v3.patch).

I Have tested with the latest patch, problem is solved..

During my testing i observed one more behaviour in the hash join, where
Parallel hash join is taking more time compared to Normal hash join,

I think the gather-reader-order patch will fix this. Here's a test
with all three patches.

Yeah right, After applying all three patches this problem is fixed, now
parallel hash join is faster than normal hash join.

I have tested one more case which Amit mentioned, I can see in that case
parallel plan (parallel degree>= 3) is still slow, In Normal case it
selects "Hash Join" but in case of parallel worker > 3 it selects Parallel
"Nest Loop Join" which is making it costlier.

CREATE TABLE t1(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 10000000) g;

CREATE TABLE t2(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 3000000) g;
Analyze t1;
Analyze t2;

postgres=# set max_parallel_degree=0;
SET
postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 =
t2.c1 AND t1.c1 BETWEEN 100 AND 200;
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=223208.93..223208.94 rows=1 width=0) (actual
time=2148.840..2148.841 rows=1 loops=1)
-> Hash Join (cost=204052.91..223208.92 rows=1 width=0) (actual
time=1925.309..2148.812 rows=101 loops=1)
Hash Cond: (t2.c1 = t1.c1)
-> Seq Scan on t2 (cost=0.00..15406.00 rows=1000000 width=4)
(actual time=0.025..104.028 rows=1000000 loops=1)
-> Hash (cost=204052.90..204052.90 rows=1 width=4) (actual
time=1925.219..1925.219 rows=101 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on t1 (cost=0.00..204052.90 rows=1 width=4)
(actual time=0.029..1925.196 rows=101 loops=1)
Filter: ((c1 >= 100) AND (c1 <= 200))
Rows Removed by Filter: 9999899
Planning time: 0.470 ms
Execution time: 2148.928 ms
(11 rows)

postgres=# set max_parallel_degree=3;
SET
postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 =
t2.c1 AND t1.c1 BETWEEN 100 AND 200;
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=78278.36..78278.37 rows=1 width=0) (actual
time=19944.113..19944.113 rows=1 loops=1)
-> Gather (cost=1000.00..78278.36 rows=1 width=0) (actual
time=0.682..19943.928 rows=101 loops=1)
Number of Workers: 3
-> Nested Loop (cost=0.00..77278.26 rows=1 width=0) (actual
time=690.633..6556.201 rows=25 loops=4)
Join Filter: (t1.c1 = t2.c1)
Rows Removed by Join Filter: 25249975
-> Parallel Seq Scan on t1 (cost=0.00..58300.83 rows=0
width=4) (actual time=619.198..619.262 rows=25 loops=4)
Filter: ((c1 >= 100) AND (c1 <= 200))
Rows Removed by Filter: 2499975
-> Seq Scan on t2 (cost=0.00..15406.00 rows=1000000
width=4) (actual time=0.008..105.757 rows=1000000 loops=101)
Planning time: 0.206 ms
Execution time: 19944.748 ms

postgres=# set max_parallel_degree=1;
SET
postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 =
t2.c1 AND t1.c1 BETWEEN 100 AND 200;
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=156191.39..156191.40 rows=1 width=0) (actual
time=1336.401..1336.401 rows=1 loops=1)
-> Hash Join (cost=137035.38..156191.39 rows=1 width=0) (actual
time=1110.562..1336.386 rows=101 loops=1)
Hash Cond: (t2.c1 = t1.c1)
-> Seq Scan on t2 (cost=0.00..15406.00 rows=1000000 width=4)
(actual time=0.025..101.659 rows=1000000 loops=1)
-> Hash (cost=137035.37..137035.37 rows=1 width=4) (actual
time=1110.486..1110.486 rows=101 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Gather (cost=1000.00..137035.37 rows=1 width=4) (actual
time=0.493..1110.445 rows=101 loops=1)
Number of Workers: 1
-> Parallel Seq Scan on t1 (cost=0.00..136035.27
rows=1 width=4) (actual time=553.212..1107.992 rows=50 loops=2)
Filter: ((c1 >= 100) AND (c1 <= 200))
Rows Removed by Filter: 4999950
Planning time: 0.211 ms
Execution time: 1336.618 ms
(13 rows)

postgres=# set max_parallel_degree=2;
SET
postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 =
t2.c1 AND t1.c1 BETWEEN 100 AND 200;
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=101777.29..101777.29 rows=1 width=0) (actual
time=1014.506..1014.507 rows=1 loops=1)
-> Hash Join (cost=82621.27..101777.28 rows=1 width=0) (actual
time=796.628..1014.493 rows=101 loops=1)
Hash Cond: (t2.c1 = t1.c1)
-> Seq Scan on t2 (cost=0.00..15406.00 rows=1000000 width=4)
(actual time=0.023..99.313 rows=1000000 loops=1)
-> Hash (cost=82621.26..82621.26 rows=1 width=4) (actual
time=796.552..796.552 rows=101 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Gather (cost=1000.00..82621.26 rows=1 width=4) (actual
time=0.435..796.499 rows=101 loops=1)
Number of Workers: 2
-> Parallel Seq Scan on t1 (cost=0.00..81621.16
rows=0 width=4) (actual time=528.052..793.243 rows=34 loops=3)
Filter: ((c1 >= 100) AND (c1 <= 200))
Rows Removed by Filter: 3333300
Planning time: 0.200 ms
Execution time: 1014.672 ms
(13 rows)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#26Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#25)
Re: parallel joins, and better parallel explain

On Wed, Dec 23, 2015 at 2:34 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I think the gather-reader-order patch will fix this. Here's a test
with all three patches.

Yeah right, After applying all three patches this problem is fixed, now
parallel hash join is faster than normal hash join.

Thanks. I've committed the two smaller patches; it seems fairly clear
that those are good changes independent of the parallel join stuff.

I have tested one more case which Amit mentioned, I can see in that case
parallel plan (parallel degree>= 3) is still slow, In Normal case it selects
"Hash Join" but in case of parallel worker > 3 it selects Parallel "Nest
Loop Join" which is making it costlier.

Hmm, I'm not sure why that is happening. I'll poke at it a bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#25)
1 attachment(s)
Re: parallel joins, and better parallel explain

On Wed, Dec 23, 2015 at 2:34 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yeah right, After applying all three patches this problem is fixed, now
parallel hash join is faster than normal hash join.

I have tested one more case which Amit mentioned, I can see in that case
parallel plan (parallel degree>= 3) is still slow, In Normal case it selects
"Hash Join" but in case of parallel worker > 3 it selects Parallel "Nest
Loop Join" which is making it costlier.

While investigating this problem, I discovered that I can produce a
regression even on unpatched master:

rhaas=# set max_parallel_degree = 0;
SET
rhaas=# explain select sum(1) from t1;
QUERY PLAN
---------------------------------------------------------------------
Aggregate (cost=1553572.00..1553572.01 rows=1 width=0)
-> Seq Scan on t1 (cost=0.00..1528572.00 rows=10000000 width=0)
(2 rows)

rhaas=# set max_parallel_degree = 3;
SET
rhaas=# explain select sum(1) from t1;
QUERY PLAN
-----------------------------------------------------------------------------------
Aggregate (cost=1462734.86..1462734.87 rows=1 width=0)
-> Gather (cost=1000.00..1437734.86 rows=10000000 width=0)
Number of Workers: 3
-> Parallel Seq Scan on t1 (cost=0.00..436734.86
rows=10000000 width=0)
(4 rows)

The problem here is that the planner imagines that the sequential scan
is going to parallelize perfectly, which is not the case. A Gather
node is ten times as expensive per tuple as a sequential scan, but
sequential scan doesn't need to pay a per-page cost, so if you crank
the number of workers up high enough, the cost per tuple appears to
drop until it eventually gets low enough that paying the cost of a
Gather node looks worthwhile. I tweaked cost_seqscan() so that it
spreads out the CPU cost among all of the workers but assumes the disk
cost has to be paid regardless, and that fixes this problem.

It doesn't fix your example, though. Even with the costing changes
mentioned above, the planner still thinks a nested loop over two
seqscans has something to recommend it:

rhaas=# Explain (Analyze, verbose) SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=161755.97..161755.98 rows=1 width=0) (actual
time=41164.506..41164.507 rows=1 loops=1)
Output: count(*)
-> Gather (cost=1000.00..161755.97 rows=1 width=0) (actual
time=0.436..41164.388 rows=101 loops=1)
Number of Workers: 3
-> Nested Loop (cost=0.00..160755.87 rows=1 width=0)
(actual time=329.227..12123.414 rows=25 loops=4)
Join Filter: (t1.c1 = t2.c1)
Rows Removed by Join Filter: 75749975
Worker 0: actual time=439.924..439.924 rows=0 loops=1
Worker 1: actual time=440.776..440.776 rows=0 loops=1
Worker 2: actual time=436.100..6449.041 rows=15 loops=1
-> Parallel Seq Scan on public.t1
(cost=0.00..102442.10 rows=0 width=4) (actual time=220.185..220.228
rows=25 loops=4)
Output: t1.c1, t1.c2
Filter: ((t1.c1 >= 100) AND (t1.c1 <= 200))
Rows Removed by Filter: 2499975
Worker 0: actual time=439.922..439.922 rows=0 loops=1
Worker 1: actual time=440.773..440.773 rows=0 loops=1
Worker 2: actual time=0.016..0.055 rows=15 loops=1
-> Seq Scan on public.t2 (cost=0.00..46217.00
rows=3000000 width=4) (actual time=0.007..235.143 rows=3000000
loops=101)
Output: t2.c1, t2.c2
Worker 2: actual time=0.012..215.711 rows=3000000 loops=15
Planning time: 0.150 ms
Execution time: 41164.597 ms

But this is not entirely the fault of the parallel query code. If you
force a seqscan-over-seqscan plan in the non-parallel cast, it
estimates the join cost as 287772.00, only slightly more than the
261522.02 cost units it thinks a non-parallel hash join will cost. In
fact, however, the non-parallel hash join runs in 1.2 seconds and the
non-parallel nested loop takes 46 seconds. So the first problem here
is that a plan that the query planner thinks is only 10% more
expensive actually runs for almost 40 times longer. If the planner
had accurately estimated the real cost of the nested loop, this plan
wouldn't have been chosen. If you set enable_nestloop=false, then you
get this plan:

rhaas=# set enable_nestloop=false;
SET
rhaas=# set max_parallel_degree=3;
SET
rhaas=# Explain (Analyze, verbose) SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=160909.22..160909.23 rows=1 width=0) (actual
time=647.010..647.011 rows=1 loops=1)
Output: count(*)
-> Hash Join (cost=103442.21..160909.22 rows=1 width=0) (actual
time=234.397..646.985 rows=101 loops=1)
Hash Cond: (t2.c1 = t1.c1)
-> Seq Scan on public.t2 (cost=0.00..46217.00 rows=3000000
width=4) (actual time=0.033..197.595 rows=3000000 loops=1)
Output: t2.c1, t2.c2
-> Hash (cost=103442.20..103442.20 rows=1 width=4) (actual
time=234.235..234.235 rows=101 loops=1)
Output: t1.c1
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Gather (cost=1000.00..103442.20 rows=1 width=4)
(actual time=0.289..234.199 rows=101 loops=1)
Output: t1.c1
Number of Workers: 3
-> Parallel Seq Scan on public.t1
(cost=0.00..102442.10 rows=0 width=4) (actual time=171.667..230.080
rows=25 loops=4)
Output: t1.c1
Filter: ((t1.c1 >= 100) AND (t1.c1 <= 200))
Rows Removed by Filter: 2499975
Worker 0: actual time=228.628..228.628 rows=0 loops=1
Worker 1: actual time=228.432..228.432 rows=0 loops=1
Worker 2: actual time=229.566..229.566 rows=0 loops=1
Planning time: 0.160 ms
Execution time: 647.133 ms
(21 rows)

And that's a good plan.

The parallel nested loop case also suffers from the fact that workers
0 and 1 don't happen to find any of the interesting rows in t1 at all,
and worker 2 only finds 15 of them. The leader finds the other 85 and
thus has to run most of the iterations of the scan on t2 itself. If
the work were divided equally, the parallel nested loop would probably
run significantly faster, although it would still be ten times slower
than the non-parallel hash join. In the long term, I think the way to
fix the uneven work distribution that happens here is to construct the
hash table in parallel, as already discussed with Simon upthread.
Then we could have a Gather node on top of a Hash Join both inputs to
which are Parallel Seq Scans, and now there's basically no risk of a
skewed work distribution.

While that would be nice to have, I think the big thing to focus on
here is how inaccurate the nested loop costing is - as already
mentioned, it thinks the non-parallel nested loop is 10% slower than
the hash join when it's really forty times slower. The main reason
for that is that ((t1.c1 >= 100) AND (t1.c1 <= 200)) actually matches
100 rows, but the planner expects it to match just one. In a real
table, there would probably be a unique index on t1 (c1), and that
also fixes the problem. If I add that, the non-parallel query runs in
422 ms (with EXPLAIN ANALYZE, on a debug build) and the parallel query
runs in 125 ms, and the row count estimates are correct, too. Even if
I disable actually using the index, the fact that it fixes the
cardinality estimates causes the query to choose a good (parallel!)
plan.

Updated patch attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel-join-v4.patchapplication/x-patch; name=parallel-join-v4.patchDownload
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 30e6b3d..5bc8eef 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -167,25 +167,25 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 	e->nnodes++;
 
 	/* Call estimators for parallel-aware nodes. */
-	switch (nodeTag(planstate))
+	if (planstate->plan->parallel_aware)
 	{
-		case T_SeqScanState:
-			ExecSeqScanEstimate((SeqScanState *) planstate,
-								e->pcxt);
-			break;
-		default:
-			break;
+		switch (nodeTag(planstate))
+		{
+			case T_SeqScanState:
+				ExecSeqScanEstimate((SeqScanState *) planstate,
+									e->pcxt);
+				break;
+			default:
+				break;
+		}
 	}
 
 	return planstate_tree_walker(planstate, ExecParallelEstimate, e);
 }
 
 /*
- * Ordinary plan nodes won't do anything here, but parallel-aware plan nodes
- * may need to initialize shared state in the DSM before parallel workers
- * are available.  They can allocate the space they previous estimated using
- * shm_toc_allocate, and add the keys they previously estimated using
- * shm_toc_insert, in each case targeting pcxt->toc.
+ * Initialize the dynamic shared memory segment that will be used to control
+ * parallel execution.
  */
 static bool
 ExecParallelInitializeDSM(PlanState *planstate,
@@ -202,15 +202,26 @@ ExecParallelInitializeDSM(PlanState *planstate,
 	/* Count this node. */
 	d->nnodes++;
 
-	/* Call initializers for parallel-aware plan nodes. */
-	switch (nodeTag(planstate))
+	/*
+	 * Call initializers for parallel-aware plan nodes.
+	 *
+	 * Ordinary plan nodes won't do anything here, but parallel-aware plan
+	 * nodes may need to initialize shared state in the DSM before parallel
+	 * workers are available.  They can allocate the space they previously
+	 * estimated using shm_toc_allocate, and add the keys they previously
+	 * estimated using shm_toc_insert, in each case targeting pcxt->toc.
+	 */
+	if (planstate->plan->parallel_aware)
 	{
-		case T_SeqScanState:
-			ExecSeqScanInitializeDSM((SeqScanState *) planstate,
-									 d->pcxt);
-			break;
-		default:
-			break;
+		switch (nodeTag(planstate))
+		{
+			case T_SeqScanState:
+				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
+										 d->pcxt);
+				break;
+			default:
+				break;
+		}
 	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeDSM, d);
@@ -623,13 +634,16 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 		return false;
 
 	/* Call initializers for parallel-aware plan nodes. */
-	switch (nodeTag(planstate))
+	if (planstate->plan->parallel_aware)
 	{
-		case T_SeqScanState:
-			ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
-			break;
-		default:
-			break;
+		switch (nodeTag(planstate))
+		{
+			case T_SeqScanState:
+				ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
+				break;
+			default:
+				break;
+		}
 	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeWorker, toc);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 7169d46..939a662 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1588,6 +1588,8 @@ _outPathInfo(StringInfo str, const Path *node)
 	else
 		_outBitmapset(str, NULL);
 	WRITE_BOOL_FIELD(parallel_aware);
+	WRITE_BOOL_FIELD(parallel_safe);
+	WRITE_INT_FIELD(parallel_degree);
 	WRITE_FLOAT_FIELD(rows, "%.0f");
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
@@ -1765,7 +1767,6 @@ _outGatherPath(StringInfo str, const GatherPath *node)
 	_outPathInfo(str, (const Path *) node);
 
 	WRITE_NODE_FIELD(subpath);
-	WRITE_INT_FIELD(num_workers);
 	WRITE_BOOL_FIELD(single_copy);
 }
 
@@ -1887,6 +1888,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_NODE_FIELD(reltargetlist);
 	WRITE_NODE_FIELD(pathlist);
 	WRITE_NODE_FIELD(ppilist);
+	WRITE_NODE_FIELD(partial_pathlist);
 	WRITE_NODE_FIELD(cheapest_startup_path);
 	WRITE_NODE_FIELD(cheapest_total_path);
 	WRITE_NODE_FIELD(cheapest_unique_path);
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 916a518..5019804 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -851,4 +851,57 @@ lateral reference.  (Perhaps now that that stuff works, we could relax the
 pullup restriction?)
 
 
--- bjm & tgl
+Parallel Query and Partial Paths
+--------------------------------
+
+Parallel query involves dividing up the work that needs to be performed
+either by an entire query or some portion of the query in such a way that
+some of that work can be done by one or more worker processes, which are
+called parallel workers.  Parallel workers are a subtype of dynamic
+background workers; see src/backend/access/transam/README.parallel for a
+fuller description.  Academic literature on parallel query suggests that
+that parallel execution strategies can be divided into essentially two
+categories: pipelined parallelism, where the execution of the query is
+divided into multiple stages and each stage is handled by a separate
+process; and partitioning parallelism, where the data is split between
+multiple processes and each process handles a subset of it.  The
+literature, however, suggests that gains from pipeline parallelism are
+often very limited due to the difficulty of avoiding pipeline stalls.
+Consequently, we do not currently attempt to generate query plans that
+use this technique.
+
+Instead, we focus on partitioning paralellism, which does not require
+that the underlying table be partitioned.  It only requires that (1)
+there is some method of dividing the data from at least one of the base
+tables involved in the relation across multiple processes, (2) allowing
+each process to handle its own portion of the data, and then (3)
+collecting the results.  Requirements (2) and (3) is satisfied by the
+executor node Gather, which launches any number of worker processes and
+executes its single child plan in all of them (and perhaps in the leader
+also, if the children aren't generating enough data to keep the leader
+busy).  Requirement (1) is handled by the SeqScan node: when invoked
+with parallel_aware = true, this node will, in effect, partition the
+table on a block by block basis, returning a subset of the tuples from
+the relation in each worker where that SeqScan is executed.  A similar
+scheme could be (and probably should be) implemented for bitmap heap
+scans.
+
+Just as we do for non-parallel access methods, we build Paths to
+represent access strategies that can be used in a parallel plan.  These
+are, in essence, the same strategies that are available in the
+non-parallel plan, but there is an important difference: a path that
+will run beneath a Gather node returns only a subset of the query
+results in each worker, not all of them.  To form a path that can
+actually be executed, the (rather large) cost of the Gather node must be
+accounted for.  For this reason among others, paths intended to run
+beneath a Gather node - which we call "partial" paths since they return
+only a subset of the results in each worker - must be kept separate from
+ordinary paths (see RelOptInfo's partial_pathlist and the function
+add_partial_path).
+
+One of the keys to making parallel query effective is to run as much of
+the query in parallel as possible.  Therefore, we expect it to generally
+be desirable to postpone the Gather stage until as near to the top of the
+plan as possible.  Expanding the range of cases in which more work can be
+pushed below the Gather (and costly them accurately) is likely to keep us
+busy for a long time to come.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 4516cd3..326de0c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -72,6 +72,7 @@ static void set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 				 Index rti, RangeTblEntry *rte);
 static void set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel,
 				   RangeTblEntry *rte);
+static void create_parallel_paths(PlannerInfo *root, RelOptInfo *rel);
 static void set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel,
 						  RangeTblEntry *rte);
 static bool function_rte_parallel_ok(RangeTblEntry *rte);
@@ -612,7 +613,6 @@ static void
 set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 {
 	Relids		required_outer;
-	int			parallel_threshold = 1000;
 
 	/*
 	 * We don't support pushing join clauses into the quals of a seqscan, but
@@ -624,39 +624,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer, 0));
 
-	/* Consider parallel sequential scan */
-	if (rel->consider_parallel && rel->pages > parallel_threshold &&
-		required_outer == NULL)
-	{
-		Path *path;
-		int parallel_degree = 1;
-
-		/*
-		 * Limit the degree of parallelism logarithmically based on the size
-		 * of the relation.  This probably needs to be a good deal more
-		 * sophisticated, but we need something here for now.
-		 */
-		while (rel->pages > parallel_threshold * 3 &&
-			   parallel_degree < max_parallel_degree)
-		{
-			parallel_degree++;
-			parallel_threshold *= 3;
-			if (parallel_threshold >= PG_INT32_MAX / 3)
-				break;
-		}
-
-		/*
-		 * Ideally we should consider postponing the gather operation until
-		 * much later, after we've pushed joins and so on atop the parallel
-		 * sequential scan path.  But we don't have the infrastructure for
-		 * that yet, so just do this for now.
-		 */
-		path = create_seqscan_path(root, rel, required_outer, parallel_degree);
-		path = (Path *)
-			create_gather_path(root, rel, path, required_outer,
-							   parallel_degree);
-		add_path(rel, path);
-	}
+	/* If appropriate, consider parallel sequential scan */
+	if (rel->consider_parallel && required_outer == NULL)
+		create_parallel_paths(root, rel);
 
 	/* Consider index scans */
 	create_index_paths(root, rel);
@@ -666,6 +636,54 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 }
 
 /*
+ * create_parallel_paths
+ *	  Build parallel access paths for a plain relation
+ */
+static void
+create_parallel_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int		parallel_threshold = 1000;
+	int		parallel_degree = 1;
+
+	/*
+	 * If this relation is too small to be worth a parallel scan, just return
+	 * without doing anything ... unless it's an inheritance child.  In that case,
+	 * we want to generate a parallel path here anyway.  It might not be worthwhile
+	 * just for this relation, but when combined with all of its inheritance siblings
+	 * it may well pay off.
+	 */
+	if (rel->pages < parallel_threshold && rel->reloptkind == RELOPT_BASEREL)
+		return;
+
+	/*
+	 * Limit the degree of parallelism logarithmically based on the size of the
+	 * relation.  This probably needs to be a good deal more sophisticated, but we
+	 * need something here for now.
+	 */
+	while (rel->pages > parallel_threshold * 3 &&
+		   parallel_degree < max_parallel_degree)
+	{
+		parallel_degree++;
+		parallel_threshold *= 3;
+		if (parallel_threshold >= PG_INT32_MAX / 3)
+			break;
+	}
+
+	/* Add an unordered partial path based on a parallel sequential scan. */
+	add_partial_path(rel, create_seqscan_path(root, rel, NULL, parallel_degree));
+
+	/*
+	 * If this is a baserel, consider gathering any partial paths we may have
+	 * just created.  If we gathered an inheritance child, we could end up
+	 * with a very large number of gather nodes, each trying to grab its own
+	 * pool of workers, so don't do this in that case.  Instead, we'll
+	 * consider gathering partial paths for the appendrel.
+	 */
+	if (rel->reloptkind == RELOPT_BASEREL)
+		generate_gather_paths(root, rel);
+}
+
+/*
  * set_tablesample_rel_size
  *	  Set size estimates for a sampled relation
  */
@@ -1039,6 +1057,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	List	   *live_childrels = NIL;
 	List	   *subpaths = NIL;
 	bool		subpaths_valid = true;
+	List	   *partial_subpaths = NIL;
+	bool		partial_subpaths_valid = true;
 	List	   *all_child_pathkeys = NIL;
 	List	   *all_child_outers = NIL;
 	ListCell   *l;
@@ -1093,6 +1113,13 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		else
 			subpaths_valid = false;
 
+		/* Same idea, but for a partial plan. */
+		if (childrel->partial_pathlist != NIL)
+			partial_subpaths = accumulate_append_subpath(partial_subpaths,
+									   linitial(childrel->partial_pathlist));
+		else
+			partial_subpaths_valid = false;
+
 		/*
 		 * Collect lists of all the available path orderings and
 		 * parameterizations for all the children.  We use these as a
@@ -1164,7 +1191,39 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 * if we have zero or one live subpath due to constraint exclusion.)
 	 */
 	if (subpaths_valid)
-		add_path(rel, (Path *) create_append_path(rel, subpaths, NULL));
+		add_path(rel, (Path *) create_append_path(rel, subpaths, NULL, 0));
+
+	/*
+	 * Consider an append of partial unordered, unparameterized partial paths.
+	 */
+	if (partial_subpaths_valid)
+	{
+		AppendPath *appendpath;
+		ListCell   *lc;
+		int			parallel_degree = 0;
+
+		/*
+		 * Decide what parallel degree to request for this append path.  For
+		 * now, we just use the maximum parallel degree of any member.  It
+		 * might be useful to use a higher number if the Append node were
+		 * smart enough to spread out the workers, but it currently isn't.
+		 */
+		foreach(lc, partial_subpaths)
+		{
+			Path	   *path = lfirst(lc);
+
+			parallel_degree = Max(parallel_degree, path->parallel_degree);
+		}
+		Assert(parallel_degree > 0);
+
+		/* Generate a partial append path. */
+		appendpath = create_append_path(rel, partial_subpaths, NULL,
+										parallel_degree);
+		add_partial_path(rel, (Path *) appendpath);
+
+		/* Consider gathering it. */
+		generate_gather_paths(root, rel);
+	}
 
 	/*
 	 * Also build unparameterized MergeAppend paths based on the collected
@@ -1214,7 +1273,7 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 		if (subpaths_valid)
 			add_path(rel, (Path *)
-					 create_append_path(rel, subpaths, required_outer));
+					 create_append_path(rel, subpaths, required_outer, 0));
 	}
 }
 
@@ -1440,8 +1499,9 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 
 	/* Discard any pre-existing paths; no further need for them */
 	rel->pathlist = NIL;
+	rel->partial_pathlist = NIL;
 
-	add_path(rel, (Path *) create_append_path(rel, NIL, NULL));
+	add_path(rel, (Path *) create_append_path(rel, NIL, NULL, 0));
 
 	/*
 	 * We set the cheapest path immediately, to ensure that IS_DUMMY_REL()
@@ -1844,6 +1904,36 @@ set_worktable_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 }
 
 /*
+ * generate_gather_paths
+ *		Generate parallel access paths for a relation by pushing a Gather on
+ *		top of a partial path.
+ */
+void
+generate_gather_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	Path	   *cheapest_partial_path;
+	Path	   *simple_gather_path;
+
+	/* If there are no partial paths, there's nothing to do here. */
+	if (rel->partial_pathlist == NIL)
+		return;
+
+	/*
+	 * The output of Gather is currently always unsorted, so there's only one
+	 * partial path of interest: the cheapest one.
+	 *
+	 * Eventually, we should have a Gather Merge operation that can merge
+	 * multiple tuple streams together while preserving their ordering.  We
+	 * could usefully generate such a path from each partial path that has
+	 * non-NIL pathkeys.
+	 */
+	cheapest_partial_path = linitial(rel->partial_pathlist);
+	simple_gather_path = (Path *)
+		create_gather_path(root, rel, cheapest_partial_path, NULL);
+	add_path(rel, simple_gather_path);
+}
+
+/*
  * make_rel_from_joinlist
  *	  Build access paths using a "joinlist" to guide the join path search.
  *
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 990486c..d7d1e32 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -186,11 +186,11 @@ clamp_row_est(double nrows)
  */
 void
 cost_seqscan(Path *path, PlannerInfo *root,
-			 RelOptInfo *baserel, ParamPathInfo *param_info,
-			 int nworkers)
+			 RelOptInfo *baserel, ParamPathInfo *param_info)
 {
 	Cost		startup_cost = 0;
-	Cost		run_cost = 0;
+	Cost		cpu_run_cost;
+	Cost		disk_run_cost;
 	double		spc_seq_page_cost;
 	QualCost	qpqual_cost;
 	Cost		cpu_per_tuple;
@@ -216,27 +216,58 @@ cost_seqscan(Path *path, PlannerInfo *root,
 	/*
 	 * disk costs
 	 */
-	run_cost += spc_seq_page_cost * baserel->pages;
+	disk_run_cost = spc_seq_page_cost * baserel->pages;
 
 	/* CPU costs */
 	get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
 
 	startup_cost += qpqual_cost.startup;
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
-	run_cost += cpu_per_tuple * baserel->tuples;
+	cpu_run_cost = cpu_per_tuple * baserel->tuples;
 
-	/*
-	 * Primitive parallel cost model.  Assume the leader will do half as much
-	 * work as a regular worker, because it will also need to read the tuples
-	 * returned by the workers when they percolate up to the gather ndoe.
-	 * This is almost certainly not exactly the right way to model this, so
-	 * this will probably need to be changed at some point...
-	 */
-	if (nworkers > 0)
-		run_cost = run_cost / (nworkers + 0.5);
+	/* Adjust costing for parallelism, if used. */
+	if (path->parallel_degree > 0)
+	{
+		double	parallel_divisor = path->parallel_degree;
+		double	leader_contribution;
+
+		/*
+		 * Early experience with parallel query suggests that when there is
+		 * only one worker, the leader often makes a very substantial
+		 * contribution to executing the parallel portion of the plan, but as
+		 * more workers are added, it does less and less, because it's busy
+		 * reading tuples from the workers and doing whatever non-paralell
+		 * post-processing is needed.  By the time we reach 4 workers, the
+		 * leader no longer makes a meaningful contribution.  Thus, for now,
+		 * estimate that the leader spends 30% of its time servicing each
+		 * worker, and the remainder executing the parallel plan.
+		 */
+		leader_contribution = 1.0 - (0.3 * path->parallel_degree);
+		if (leader_contribution > 0)
+			parallel_divisor += leader_contribution;
+
+		/*
+		 * In the case of a parallel plan, the row count needs to represent
+		 * the number of tuples processed per worker.  Otherwise, higher-level
+		 * plan nodes that appear below the gather will be costed incorrectly,
+		 * because they'll anticipate receiving more rows than any given copy
+		 * will actually get.
+		 */
+		path->rows /= parallel_divisor;
+
+		/* The CPU cost is divided among all the workers. */
+		cpu_run_cost /= parallel_divisor;
+
+		/*
+		 * It may be possible to amortize some of the I/O cost, but probably
+		 * not very much, because most operating systems already do aggressive
+		 * prefetching.  For now, we assume that the disk run cost can't be
+		 * amortized at all.
+		 */
+	}
 
 	path->startup_cost = startup_cost;
-	path->total_cost = startup_cost + run_cost;
+	path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;
 }
 
 /*
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 53d8fdd..33268d1 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -34,6 +34,12 @@ static void sort_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 static void match_unsorted_outer(PlannerInfo *root, RelOptInfo *joinrel,
 					 RelOptInfo *outerrel, RelOptInfo *innerrel,
 					 JoinType jointype, JoinPathExtraData *extra);
+static void consider_parallel_nestloop(PlannerInfo *root,
+						   RelOptInfo *joinrel,
+						   RelOptInfo *outerrel,
+						   RelOptInfo *innerrel,
+						   JoinType jointype,
+						   JoinPathExtraData *extra);
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 					 RelOptInfo *outerrel, RelOptInfo *innerrel,
 					 JoinType jointype, JoinPathExtraData *extra);
@@ -216,7 +222,12 @@ add_paths_to_joinrel(PlannerInfo *root,
 												 jointype, &extra);
 
 	/*
-	 * 6. Finally, give extensions a chance to manipulate the path list.
+	 * 6. Consider gathering partial paths.
+	 */
+	generate_gather_paths(root, joinrel);
+
+	/*
+	 * 7. Finally, give extensions a chance to manipulate the path list.
 	 */
 	if (set_join_pathlist_hook)
 		set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
@@ -330,6 +341,62 @@ try_nestloop_path(PlannerInfo *root,
 }
 
 /*
+ * try_partial_nestloop_path
+ *	  Consider a partial nestloop join path; if it appears useful, push it into
+ *	  the joinrel's partial_pathlist via add_partial_path().
+ */
+static void
+try_partial_nestloop_path(PlannerInfo *root,
+				  RelOptInfo *joinrel,
+				  Path *outer_path,
+				  Path *inner_path,
+				  List *pathkeys,
+				  JoinType jointype,
+				  JoinPathExtraData *extra)
+{
+	JoinCostWorkspace workspace;
+
+	/*
+	 * If the inner path is parameterized, the parameterization must be fully
+	 * satisfied by the proposed outer path.  Parameterized partial paths are
+	 * not supported.  The caller should already have verified that no
+	 * extra_lateral_rels are required here.
+	 */
+	Assert(bms_is_empty(joinrel->lateral_relids));
+	if (inner_path->param_info != NULL)
+	{
+		Relids		inner_paramrels = inner_path->param_info->ppi_req_outer;
+
+		if (!bms_is_subset(inner_paramrels, outer_path->parent->relids))
+			return;
+	}
+
+	/*
+	 * Before creating a path, get a quick lower bound on what it is likely
+	 * to cost.  Bail out right away if it looks terrible.
+	 */
+	initial_cost_nestloop(root, &workspace, jointype,
+						  outer_path, inner_path,
+						  extra->sjinfo, &extra->semifactors);
+	if (!add_partial_path_precheck(joinrel, workspace.total_cost, pathkeys))
+		return;
+
+	/* Might be good enough to be worth trying, so let's try it. */
+	add_partial_path(joinrel, (Path *)
+			 create_nestloop_path(root,
+								  joinrel,
+								  jointype,
+								  &workspace,
+								  extra->sjinfo,
+								  &extra->semifactors,
+								  outer_path,
+								  inner_path,
+								  extra->restrictlist,
+								  pathkeys,
+								  NULL));
+}
+
+/*
  * try_mergejoin_path
  *	  Consider a merge join path; if it appears useful, push it into
  *	  the joinrel's pathlist via add_path().
@@ -472,6 +539,62 @@ try_hashjoin_path(PlannerInfo *root,
 }
 
 /*
+ * try_partial_hashjoin_path
+ *	  Consider a partial hashjoin join path; if it appears useful, push it into
+ *	  the joinrel's partial_pathlist via add_partial_path().
+ */
+static void
+try_partial_hashjoin_path(PlannerInfo *root,
+						  RelOptInfo *joinrel,
+						  Path *outer_path,
+						  Path *inner_path,
+						  List *hashclauses,
+						  JoinType jointype,
+						  JoinPathExtraData *extra)
+{
+	JoinCostWorkspace workspace;
+
+	/*
+	 * If the inner path is parameterized, the parameterization must be fully
+	 * satisfied by the proposed outer path.  Parameterized partial paths are
+	 * not supported.  The caller should already have verified that no
+	 * extra_lateral_rels are required here.
+	 */
+	Assert(bms_is_empty(joinrel->lateral_relids));
+	if (inner_path->param_info != NULL)
+	{
+		Relids		inner_paramrels = inner_path->param_info->ppi_req_outer;
+
+		if (!bms_is_empty(inner_paramrels))
+			return;
+	}
+
+	/*
+	 * Before creating a path, get a quick lower bound on what it is likely
+	 * to cost.  Bail out right away if it looks terrible.
+	 */
+	initial_cost_hashjoin(root, &workspace, jointype, hashclauses,
+						  outer_path, inner_path,
+						  extra->sjinfo, &extra->semifactors);
+	if (!add_partial_path_precheck(joinrel, workspace.total_cost, NIL))
+		return;
+
+	/* Might be good enough to be worth trying, so let's try it. */
+	add_partial_path(joinrel, (Path *)
+			 create_hashjoin_path(root,
+								  joinrel,
+								  jointype,
+								  &workspace,
+								  extra->sjinfo,
+								  &extra->semifactors,
+								  outer_path,
+								  inner_path,
+								  extra->restrictlist,
+								  NULL,
+								  hashclauses));
+}
+
+/*
  * clause_sides_match_join
  *	  Determine whether a join clause is of the right form to use in this join.
  *
@@ -1063,6 +1186,85 @@ match_unsorted_outer(PlannerInfo *root,
 				break;
 		}
 	}
+
+	/*
+	 * If the joinrel is parallel-safe and the join type supports nested loops,
+	 * we may be able to consider a partial nestloop plan.  However, we can't
+	 * handle JOIN_UNIQUE_OUTER, because the outer path will be partial, and
+	 * therefore we won't be able to properly guarantee uniqueness.  Nor can
+	 * we handle extra_lateral_rels, since partial paths must not be
+	 * parameterized.
+	 */
+	if (joinrel->consider_parallel && nestjoinOK &&
+		save_jointype != JOIN_UNIQUE_OUTER &&
+		bms_is_empty(joinrel->lateral_relids))
+		consider_parallel_nestloop(root, joinrel, outerrel, innerrel,
+								   save_jointype, extra);
+}
+
+/*
+ * consider_parallel_nestloop
+ *	  Try to build partial paths for a joinrel by joining a partial path for the
+ *	  outer relation to a complete path for the inner relation.
+ *
+ * 'joinrel' is the join relation
+ * 'outerrel' is the outer join relation
+ * 'innerrel' is the inner join relation
+ * 'jointype' is the type of join to do
+ * 'extra' contains additional input values
+ */
+static void
+consider_parallel_nestloop(PlannerInfo *root,
+						   RelOptInfo *joinrel,
+						   RelOptInfo *outerrel,
+						   RelOptInfo *innerrel,
+						   JoinType jointype,
+						   JoinPathExtraData *extra)
+{
+	ListCell   *lc1;
+
+	foreach(lc1, outerrel->partial_pathlist)
+	{
+		Path	   *outerpath = (Path *) lfirst(lc1);
+		List	   *pathkeys;
+		ListCell   *lc2;
+
+		/* Figure out what useful ordering any paths we create will have. */
+		pathkeys = build_join_pathkeys(root, joinrel, jointype,
+									   outerpath->pathkeys);
+
+		/*
+		 * Try the cheapest parameterized paths; only those which will
+		 * produce an unparameterized path when joined to this outerrel
+		 * will survive try_partial_nestloop_path.  The cheapest
+		 * unparameterized path is also in this list.
+		 */
+		foreach(lc2, innerrel->cheapest_parameterized_paths)
+		{
+			Path	   *innerpath = (Path *) lfirst(lc2);
+
+			/* Can't join to an inner path that is not parallel-safe */
+			if (!innerpath->parallel_safe)
+				continue;
+
+			/*
+			 * Like match_unsorted_outer, we only consider a single nestloop
+			 * path when the jointype is JOIN_UNIQUE_INNER.  But we have to scan
+			 * cheapest_parameterized_paths to find the one we want to consider,
+			 * because cheapest_total_path might not be parallel-safe.
+			 */
+			if (jointype == JOIN_UNIQUE_INNER)
+			{
+				if (!bms_is_empty(PATH_REQ_OUTER(innerpath)))
+					continue;
+				innerpath = (Path *) create_unique_path(root, innerrel,
+											   innerpath, extra->sjinfo);
+			}
+
+			try_partial_nestloop_path(root, joinrel, outerpath, innerpath,
+									  pathkeys, jointype, extra);
+		}
+	}
 }
 
 /*
@@ -1240,6 +1442,55 @@ hash_inner_and_outer(PlannerInfo *root,
 				}
 			}
 		}
+
+		/*
+		 * If the joinrel is parallel-safe, we may be able to consider a
+		 * partial hash join.  However, we can't handle JOIN_UNIQUE_OUTER,
+		 * because the outer path will be partial, and therefore we won't be
+		 * able to properly guarantee uniqueness.  Also, the resulting path
+		 * must not be parameterized.
+		 */
+		if (joinrel->consider_parallel && jointype != JOIN_UNIQUE_OUTER &&
+			outerrel->partial_pathlist != NIL &&
+			bms_is_empty(joinrel->lateral_relids))
+		{
+			Path   *cheapest_partial_outer;
+			Path   *cheapest_safe_inner = NULL;
+
+			cheapest_partial_outer =
+				(Path *) linitial(outerrel->partial_pathlist);
+
+			/*
+			 * Normally, given that the joinrel is parallel-safe, the cheapest
+			 * total inner path will also be parallel-safe, but if not, we'll
+			 * have to search cheapest_parameterized_paths for the cheapest
+			 * unparameterized inner path.
+			 */
+			if (cheapest_total_inner->parallel_safe)
+				cheapest_safe_inner = cheapest_total_inner;
+			else
+			{
+				ListCell   *lc;
+
+				foreach(lc, innerrel->cheapest_parameterized_paths)
+				{
+					Path	   *innerpath = (Path *) lfirst(lc);
+
+					if (innerpath->parallel_safe &&
+						bms_is_empty(PATH_REQ_OUTER(innerpath)))
+					{
+						cheapest_safe_inner = innerpath;
+						break;
+					}
+				}
+			}
+
+			if (cheapest_safe_inner != NULL)
+				try_partial_hashjoin_path(root, joinrel,
+										  cheapest_partial_outer,
+										  cheapest_safe_inner,
+										  hashclauses, jointype, extra);
+		}
 	}
 }
 
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index ad58058..e32f445 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1194,9 +1194,10 @@ mark_dummy_rel(RelOptInfo *rel)
 
 	/* Evict any previously chosen paths */
 	rel->pathlist = NIL;
+	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(rel, NIL, NULL));
+	add_path(rel, (Path *) create_append_path(rel, NIL, NULL, 0));
 
 	/* Set or update cheapest_total_path and related fields */
 	set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 01209aa..0c82529 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1128,7 +1128,7 @@ create_gather_plan(PlannerInfo *root, GatherPath *best_path)
 
 	gather_plan = make_gather(subplan->targetlist,
 							  NIL,
-							  best_path->num_workers,
+							  best_path->path.parallel_degree,
 							  best_path->single_copy,
 							  subplan);
 
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index 894f968..a8ff2ff 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -84,7 +84,8 @@ query_planner(PlannerInfo *root, List *tlist,
 
 		/* The only path for it is a trivial Result path */
 		add_path(final_rel, (Path *)
-				 create_result_path((List *) parse->jointree->quals));
+				 create_result_path(final_rel,
+									(List *) parse->jointree->quals));
 
 		/* Select cheapest path (pretty easy in this case...) */
 		set_cheapest(final_rel);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index ec0910d..d1c8e00 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -217,7 +217,12 @@ compare_path_costs_fuzzily(Path *path1, Path *path2, double fuzz_factor)
  * The cheapest_parameterized_paths list collects all parameterized paths
  * that have survived the add_path() tournament for this relation.  (Since
  * add_path ignores pathkeys for a parameterized path, these will be paths
- * that have best cost or best row count for their parameterization.)
+ * that have best cost or best row count for their parameterization.  We
+ * may also have both a parallel-safe and a non-parallel-safe path in some
+ * cases for the same parameterization in some cases, but this should be
+ * relatively rare since, most typically, all paths for the same relation
+ * will be parallel-safe or none of them will.)
+ *
  * cheapest_parameterized_paths always includes the cheapest-total
  * unparameterized path, too, if there is one; the users of that list find
  * it more convenient if that's included.
@@ -352,11 +357,12 @@ set_cheapest(RelOptInfo *parent_rel)
  *	  A path is worthy if it has a better sort order (better pathkeys) or
  *	  cheaper cost (on either dimension), or generates fewer rows, than any
  *	  existing path that has the same or superset parameterization rels.
+ *	  We also consider parallel-safe paths more worthy than others.
  *
  *	  We also remove from the rel's pathlist any old paths that are dominated
  *	  by new_path --- that is, new_path is cheaper, at least as well ordered,
- *	  generates no more rows, and requires no outer rels not required by the
- *	  old path.
+ *	  generates no more rows, requires no outer rels not required by the old
+ *	  path, and is no less parallel-safe.
  *
  *	  In most cases, a path with a superset parameterization will generate
  *	  fewer rows (since it has more join clauses to apply), so that those two
@@ -470,14 +476,16 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 						{
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET1) &&
-								new_path->rows <= old_path->rows)
+								new_path->rows <= old_path->rows &&
+								new_path->parallel_safe >= old_path->parallel_safe)
 								remove_old = true;		/* new dominates old */
 						}
 						else if (keyscmp == PATHKEYS_BETTER2)
 						{
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET2) &&
-								new_path->rows >= old_path->rows)
+								new_path->rows >= old_path->rows &&
+								new_path->parallel_safe <= old_path->parallel_safe)
 								accept_new = false;		/* old dominates new */
 						}
 						else	/* keyscmp == PATHKEYS_EQUAL */
@@ -487,19 +495,25 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 								/*
 								 * Same pathkeys and outer rels, and fuzzily
 								 * the same cost, so keep just one; to decide
-								 * which, first check rows and then do a fuzzy
-								 * cost comparison with very small fuzz limit.
-								 * (We used to do an exact cost comparison,
-								 * but that results in annoying
-								 * platform-specific plan variations due to
-								 * roundoff in the cost estimates.)  If things
-								 * are still tied, arbitrarily keep only the
-								 * old path.  Notice that we will keep only
-								 * the old path even if the less-fuzzy
-								 * comparison decides the startup and total
-								 * costs compare differently.
+								 * which, first check parallel-safety, then
+								 * rows, then do a fuzzy cost comparison with
+								 * very small fuzz limit.  (We used to do an
+								 * exact cost comparison, but that results in
+								 * annoying platform-specific plan variations
+								 * due to roundoff in the cost estimates.)	If
+								 * things are still tied, arbitrarily keep
+								 * only the old path.  Notice that we will
+								 * keep only the old path even if the
+								 * less-fuzzy comparison decides the startup
+								 * and total costs compare differently.
 								 */
-								if (new_path->rows < old_path->rows)
+								if (new_path->parallel_safe >
+									old_path->parallel_safe)
+									remove_old = true;	/* new dominates old */
+								else if (new_path->parallel_safe <
+										 old_path->parallel_safe)
+									accept_new = false; /* old dominates new */
+								else if (new_path->rows < old_path->rows)
 									remove_old = true;	/* new dominates old */
 								else if (new_path->rows > old_path->rows)
 									accept_new = false; /* old dominates new */
@@ -512,10 +526,12 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 														 * dominates new */
 							}
 							else if (outercmp == BMS_SUBSET1 &&
-									 new_path->rows <= old_path->rows)
+									 new_path->rows <= old_path->rows &&
+									 new_path->parallel_safe >= old_path->parallel_safe)
 								remove_old = true;		/* new dominates old */
 							else if (outercmp == BMS_SUBSET2 &&
-									 new_path->rows >= old_path->rows)
+									 new_path->rows >= old_path->rows &&
+									 new_path->parallel_safe <= old_path->parallel_safe)
 								accept_new = false;		/* old dominates new */
 							/* else different parameterizations, keep both */
 						}
@@ -527,7 +543,8 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 												   PATH_REQ_OUTER(old_path));
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET1) &&
-								new_path->rows <= old_path->rows)
+								new_path->rows <= old_path->rows &&
+								new_path->parallel_safe >= old_path->parallel_safe)
 								remove_old = true;		/* new dominates old */
 						}
 						break;
@@ -538,7 +555,8 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 												   PATH_REQ_OUTER(old_path));
 							if ((outercmp == BMS_EQUAL ||
 								 outercmp == BMS_SUBSET2) &&
-								new_path->rows >= old_path->rows)
+								new_path->rows >= old_path->rows &&
+								new_path->parallel_safe <= old_path->parallel_safe)
 								accept_new = false;		/* old dominates new */
 						}
 						break;
@@ -685,6 +703,214 @@ add_path_precheck(RelOptInfo *parent_rel,
 	return true;
 }
 
+/*
+ * add_partial_path
+ *	  Like add_path, our goal here is to consider whether a path is worthy
+ *	  of being kept around, but the considerations here are a bit different.
+ *	  A partial path is one which can be executed in any number of workers in
+ *	  parallel such that each worker will generate a subset of the path's
+ *	  overall result.
+ *
+ *	  We don't generate parameterized partial paths for several reasons.  Most
+ *	  importantly, they're not safe to execute, because there's nothing to
+ *	  make sure that a parallel scan within the parameterized portion of the
+ *	  plan is running with the same value in every worker at the same time.
+ *	  Fortunately, it seems unlikely to be worthwhile anyway, because having
+ *	  each worker scan the entire outer relation and a subset of the inner
+ *	  relation will generally be a terrible plan.  The inner (parameterized)
+ *	  side of the plan will be small anyway.  There could be rare cases where
+ *	  this wins big - e.g. if join order constraints put a 1-row relation on
+ *	  the outer side of the topmost join with a parameterized plan on the inner
+ *	  side - but we'll have to be content not to handle such cases until somebody
+ *	  builds an executor infrastructure that can cope with them.
+ *
+ *	  Because we don't consider parameterized paths here, we also don't
+ *	  need to consider the row counts as a measure of quality: every path will
+ *	  produce the same number of rows.  Neither do we need to consider startup
+ *	  costs: parallelism is only used for plans that will be run to completion.
+ *	  Therefore, this routine is much simpler than add_path: it needs to
+ *	  consider only pathkeys and total cost.
+ */
+void
+add_partial_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	bool		accept_new = true;		/* unless we find a superior old path */
+	ListCell   *insert_after = NULL;	/* where to insert new item */
+	ListCell   *p1;
+	ListCell   *p1_prev;
+	ListCell   *p1_next;
+
+	/* Check for query cancel. */
+	CHECK_FOR_INTERRUPTS();
+
+	/*
+	 * As in add_path, throw out any paths which are dominated by the new
+	 * path, but throw out the new path if some existing path dominates it.
+	 */
+	p1_prev = NULL;
+	for (p1 = list_head(parent_rel->partial_pathlist); p1 != NULL;
+		 p1 = p1_next)
+	{
+		Path	   *old_path = (Path *) lfirst(p1);
+		bool		remove_old = false; /* unless new proves superior */
+		PathKeysComparison keyscmp;
+
+		p1_next = lnext(p1);
+
+		/* Compare pathkeys. */
+		keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys);
+
+		/* Unless pathkeys are incompable, keep just one of the two paths. */
+		if (keyscmp != PATHKEYS_DIFFERENT)
+		{
+			if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR)
+			{
+				/* New path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER1)
+					accept_new = false;
+			}
+			else if (old_path->total_cost > new_path->total_cost
+					 * STD_FUZZ_FACTOR)
+			{
+				/* Old path costs more; keep it only if pathkeys are better. */
+				if (keyscmp != PATHKEYS_BETTER2)
+					remove_old = true;
+			}
+			else if (keyscmp == PATHKEYS_BETTER1)
+			{
+				/* Costs are about the same, new path has better pathkeys. */
+				remove_old = true;
+			}
+			else if (keyscmp == PATHKEYS_BETTER2)
+			{
+				/* Costs are about the same, old path has better pathkeys. */
+				accept_new = false;
+			}
+			else if (old_path->total_cost > new_path->total_cost * 1.0000000001)
+			{
+				/* Pathkeys are the same, and the old path costs more. */
+				remove_old = true;
+			}
+			else
+			{
+				/*
+				 * Pathkeys are the same, and new path isn't materially
+				 * cheaper.
+				 */
+				accept_new = false;
+			}
+		}
+
+		/*
+		 * Remove current element from partial_pathlist if dominated by new.
+		 */
+		if (remove_old)
+		{
+			parent_rel->partial_pathlist =
+				list_delete_cell(parent_rel->partial_pathlist, p1, p1_prev);
+			/* add_path has a special case for IndexPath; we don't need it */
+			Assert(!IsA(old_path, IndexPath));
+			pfree(old_path);
+			/* p1_prev does not advance */
+		}
+		else
+		{
+			/* new belongs after this old path if it has cost >= old's */
+			if (new_path->total_cost >= old_path->total_cost)
+				insert_after = p1;
+			/* p1_prev advances */
+			p1_prev = p1;
+		}
+
+		/*
+		 * If we found an old path that dominates new_path, we can quit
+		 * scanning the partial_pathlist; we will not add new_path, and we
+		 * assume new_path cannot dominate any later path.
+		 */
+		if (!accept_new)
+			break;
+	}
+
+	if (accept_new)
+	{
+		/* Accept the new path: insert it at proper place */
+		if (insert_after)
+			lappend_cell(parent_rel->partial_pathlist, insert_after, new_path);
+		else
+			parent_rel->partial_pathlist =
+				lcons(new_path, parent_rel->partial_pathlist);
+	}
+	else
+	{
+		/* add_path has a special case for IndexPath; we don't need it */
+		Assert(!IsA(new_path, IndexPath));
+		/* Reject and recycle the new path */
+		pfree(new_path);
+	}
+}
+
+/*
+ * add_partial_path_precheck
+ *	  Check whether a proposed new partial path could possibly get accepted.
+ *
+ * Unlike add_path_precheck, we can ignore startup cost and parameterization,
+ * since they don't matter for partial paths (see add_partial_path).  But
+ * we do want to make sure we don't add a partial path if there's already
+ * a complete path that dominates it, since in that case the proposed path
+ * is surely a loser.
+ */
+bool
+add_partial_path_precheck(RelOptInfo *parent_rel, Cost total_cost,
+						  List *pathkeys)
+{
+	ListCell   *p1;
+
+	/*
+	 * Our goal here is twofold.  First, we want to find out whether this path
+	 * is clearly inferior to some existing partial path.  If so, we want to
+	 * reject it immediately.  Second, we want to find out whether this path
+	 * is clearly superior to some existing partial path -- at least, modulo
+	 * final cost computations.  If so, we definitely want to consider it.
+	 *
+	 * Unlike add_path(), we always compare pathkeys here.  This is because we
+	 * expect partial_pathlist to be very short, and getting a definitive
+	 * answer at this stage avoids the need to call add_path_precheck.
+	 */
+	foreach(p1, parent_rel->partial_pathlist)
+	{
+		Path	   *old_path = (Path *) lfirst(p1);
+		PathKeysComparison keyscmp;
+
+		keyscmp = compare_pathkeys(pathkeys, old_path->pathkeys);
+		if (keyscmp != PATHKEYS_DIFFERENT)
+		{
+			if (total_cost > old_path->total_cost * STD_FUZZ_FACTOR &&
+				keyscmp != PATHKEYS_BETTER1)
+				return false;
+			if (old_path->total_cost > total_cost * STD_FUZZ_FACTOR &&
+				keyscmp != PATHKEYS_BETTER2)
+				return true;
+		}
+	}
+
+	/*
+	 * This path is neither clearly inferior to an existing partial path nor
+	 * clearly good enough that it might replace one.  Compare it to
+	 * non-parallel plans.  If it loses even before accounting for the cost of
+	 * the Gather node, we should definitely reject it.
+	 *
+	 * Note that we pass the total_cost to add_path_precheck twice.  This is
+	 * because it's never advantageous to consider the startup cost of a
+	 * partial path; the resulting plans, if run in parallel, will be run to
+	 * completion.
+	 */
+	if (!add_path_precheck(parent_rel, total_cost, total_cost, pathkeys,
+						   NULL))
+		return false;
+
+	return true;
+}
+
 
 /*****************************************************************************
  *		PATH NODE CREATION ROUTINES
@@ -697,7 +923,7 @@ add_path_precheck(RelOptInfo *parent_rel,
  */
 Path *
 create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
-					Relids required_outer, int nworkers)
+					Relids required_outer, int parallel_degree)
 {
 	Path	   *pathnode = makeNode(Path);
 
@@ -705,10 +931,12 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->parent = rel;
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
-	pathnode->parallel_aware = nworkers > 0 ? true : false;
+	pathnode->parallel_aware = parallel_degree > 0 ? true : false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = parallel_degree;
 	pathnode->pathkeys = NIL;	/* seqscan has unordered result */
 
-	cost_seqscan(pathnode, root, rel, pathnode->param_info, nworkers);
+	cost_seqscan(pathnode, root, rel, pathnode->param_info);
 
 	return pathnode;
 }
@@ -727,6 +955,8 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* samplescan has unordered result */
 
 	cost_samplescan(pathnode, root, rel, pathnode->param_info);
@@ -781,6 +1011,8 @@ create_index_path(PlannerInfo *root,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = pathkeys;
 
 	/* Convert clauses to indexquals the executor can handle */
@@ -827,6 +1059,8 @@ create_bitmap_heap_path(PlannerInfo *root,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = bitmapqual->parallel_safe;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->bitmapqual = bitmapqual;
@@ -852,7 +1086,17 @@ create_bitmap_and_path(PlannerInfo *root,
 	pathnode->path.pathtype = T_BitmapAnd;
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = NULL;	/* not used in bitmap trees */
+
+	/*
+	 * Currently, a BitmapHeapPath, BitmapAndPath, or BitmapOrPath will be
+	 * parallel-safe if and only if rel->consider_parallel is set.  So, we can
+	 * set the flag for this path based only on the relation-level flag,
+	 * without actually iterating over the list of children.
+	 */
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
+
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->bitmapquals = bitmapquals;
@@ -877,7 +1121,17 @@ create_bitmap_or_path(PlannerInfo *root,
 	pathnode->path.pathtype = T_BitmapOr;
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = NULL;	/* not used in bitmap trees */
+
+	/*
+	 * Currently, a BitmapHeapPath, BitmapAndPath, or BitmapOrPath will be
+	 * parallel-safe if and only if rel->consider_parallel is set.  So, we can
+	 * set the flag for this path based only on the relation-level flag,
+	 * without actually iterating over the list of children.
+	 */
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
+
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->bitmapquals = bitmapquals;
@@ -903,6 +1157,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;		/* always unordered */
 
 	pathnode->tidquals = tidquals;
@@ -921,7 +1177,8 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals,
  * Note that we must handle subpaths = NIL, representing a dummy access path.
  */
 AppendPath *
-create_append_path(RelOptInfo *rel, List *subpaths, Relids required_outer)
+create_append_path(RelOptInfo *rel, List *subpaths, Relids required_outer,
+				   int parallel_degree)
 {
 	AppendPath *pathnode = makeNode(AppendPath);
 	ListCell   *l;
@@ -931,6 +1188,8 @@ create_append_path(RelOptInfo *rel, List *subpaths, Relids required_outer)
 	pathnode->path.param_info = get_appendrel_parampathinfo(rel,
 															required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = parallel_degree;
 	pathnode->path.pathkeys = NIL;		/* result is always considered
 										 * unsorted */
 	pathnode->subpaths = subpaths;
@@ -955,6 +1214,8 @@ create_append_path(RelOptInfo *rel, List *subpaths, Relids required_outer)
 		if (l == list_head(subpaths))	/* first node? */
 			pathnode->path.startup_cost = subpath->startup_cost;
 		pathnode->path.total_cost += subpath->total_cost;
+		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
+			subpath->parallel_safe;
 
 		/* All child paths must have same parameterization */
 		Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));
@@ -985,6 +1246,8 @@ create_merge_append_path(PlannerInfo *root,
 	pathnode->path.param_info = get_appendrel_parampathinfo(rel,
 															required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = pathkeys;
 	pathnode->subpaths = subpaths;
 
@@ -1008,6 +1271,8 @@ create_merge_append_path(PlannerInfo *root,
 		Path	   *subpath = (Path *) lfirst(l);
 
 		pathnode->path.rows += subpath->rows;
+		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
+			subpath->parallel_safe;
 
 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
 		{
@@ -1052,7 +1317,7 @@ create_merge_append_path(PlannerInfo *root,
  *	  This is only used for the case of a query with an empty jointree.
  */
 ResultPath *
-create_result_path(List *quals)
+create_result_path(RelOptInfo *rel, List *quals)
 {
 	ResultPath *pathnode = makeNode(ResultPath);
 
@@ -1060,6 +1325,8 @@ create_result_path(List *quals)
 	pathnode->path.parent = NULL;
 	pathnode->path.param_info = NULL;	/* there are no other rels... */
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = NIL;
 	pathnode->quals = quals;
 
@@ -1094,6 +1361,8 @@ create_material_path(RelOptInfo *rel, Path *subpath)
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = subpath->param_info;
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = subpath->parallel_safe;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
@@ -1155,6 +1424,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = subpath->param_info;
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = subpath->parallel_safe;
+	pathnode->path.parallel_degree = 0;
 
 	/*
 	 * Assume the output is unsorted, since we don't necessarily have pathkeys
@@ -1328,19 +1599,30 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
  */
 GatherPath *
 create_gather_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
-				   Relids required_outer, int nworkers)
+				   Relids required_outer)
 {
 	GatherPath *pathnode = makeNode(GatherPath);
 
+	Assert(subpath->parallel_safe);
+
 	pathnode->path.pathtype = T_Gather;
 	pathnode->path.parent = rel;
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = false;
+	pathnode->path.parallel_degree = subpath->parallel_degree;
 	pathnode->path.pathkeys = NIL;		/* Gather has unordered result */
 
 	pathnode->subpath = subpath;
-	pathnode->num_workers = nworkers;
+	pathnode->single_copy = false;
+
+	if (pathnode->path.parallel_degree == 0)
+	{
+		pathnode->path.parallel_degree = 1;
+		pathnode->path.pathkeys = subpath->pathkeys;
+		pathnode->single_copy = true;
+	}
 
 	cost_gather(pathnode, root, rel, pathnode->path.param_info);
 
@@ -1393,6 +1675,8 @@ create_subqueryscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = pathkeys;
 
 	cost_subqueryscan(pathnode, root, rel, pathnode->param_info);
@@ -1416,6 +1700,8 @@ create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = pathkeys;
 
 	cost_functionscan(pathnode, root, rel, pathnode->param_info);
@@ -1439,6 +1725,8 @@ create_valuesscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* result is always unordered */
 
 	cost_valuesscan(pathnode, root, rel, pathnode->param_info);
@@ -1461,6 +1749,8 @@ create_ctescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* XXX for now, result is always unordered */
 
 	cost_ctescan(pathnode, root, rel, pathnode->param_info);
@@ -1484,6 +1774,8 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->param_info = get_baserel_parampathinfo(root, rel,
 													 required_outer);
 	pathnode->parallel_aware = false;
+	pathnode->parallel_safe = rel->consider_parallel;
+	pathnode->parallel_degree = 0;
 	pathnode->pathkeys = NIL;	/* result is always unordered */
 
 	/* Cost is the same as for a regular CTE scan */
@@ -1517,6 +1809,8 @@ create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = rel->consider_parallel;
+	pathnode->path.parallel_degree = 0;
 	pathnode->path.rows = rows;
 	pathnode->path.startup_cost = startup_cost;
 	pathnode->path.total_cost = total_cost;
@@ -1653,6 +1947,10 @@ create_nestloop_path(PlannerInfo *root,
 								  required_outer,
 								  &restrict_clauses);
 	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_safe = joinrel->consider_parallel &&
+		outer_path->parallel_safe && inner_path->parallel_safe;
+	/* This is a foolish way to estimate parallel_degree, but for now... */
+	pathnode->path.parallel_degree = outer_path->parallel_degree;
 	pathnode->path.pathkeys = pathkeys;
 	pathnode->jointype = jointype;
 	pathnode->outerjoinpath = outer_path;
@@ -1711,6 +2009,9 @@ create_mergejoin_path(PlannerInfo *root,
 								  required_outer,
 								  &restrict_clauses);
 	pathnode->jpath.path.parallel_aware = false;
+	pathnode->jpath.path.parallel_safe = joinrel->consider_parallel &&
+		outer_path->parallel_safe && inner_path->parallel_safe;
+	pathnode->jpath.path.parallel_degree = 0;
 	pathnode->jpath.path.pathkeys = pathkeys;
 	pathnode->jpath.jointype = jointype;
 	pathnode->jpath.outerjoinpath = outer_path;
@@ -1768,6 +2069,10 @@ create_hashjoin_path(PlannerInfo *root,
 								  required_outer,
 								  &restrict_clauses);
 	pathnode->jpath.path.parallel_aware = false;
+	pathnode->jpath.path.parallel_safe = joinrel->consider_parallel &&
+		outer_path->parallel_safe && inner_path->parallel_safe;
+	/* This is a foolish way to estimate parallel_degree, but for now... */
+	pathnode->jpath.path.parallel_degree = outer_path->parallel_degree;
 
 	/*
 	 * A hashjoin never has pathkeys, since its output ordering is
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index f2bdfcc..29ce75f 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -107,6 +107,7 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptKind reloptkind)
 	rel->reltargetlist = NIL;
 	rel->pathlist = NIL;
 	rel->ppilist = NIL;
+	rel->partial_pathlist = NIL;
 	rel->cheapest_startup_path = NULL;
 	rel->cheapest_total_path = NULL;
 	rel->cheapest_unique_path = NULL;
@@ -370,6 +371,7 @@ build_join_rel(PlannerInfo *root,
 	joinrel->reltargetlist = NIL;
 	joinrel->pathlist = NIL;
 	joinrel->ppilist = NIL;
+	joinrel->partial_pathlist = NIL;
 	joinrel->cheapest_startup_path = NULL;
 	joinrel->cheapest_total_path = NULL;
 	joinrel->cheapest_unique_path = NULL;
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 5393005..f9f13b4 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -458,6 +458,7 @@ typedef struct RelOptInfo
 	List	   *reltargetlist;	/* Vars to be output by scan of relation */
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
+	List	   *partial_pathlist;	/* partial Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -759,6 +760,8 @@ typedef struct Path
 	RelOptInfo *parent;			/* the relation this path can build */
 	ParamPathInfo *param_info;	/* parameterization info, or NULL if none */
 	bool		parallel_aware; /* engage parallel-aware logic? */
+	bool		parallel_safe;	/* OK to use as part of parallel plan? */
+	int			parallel_degree; /* desired parallel degree; 0 = not parallel */
 
 	/* estimated size/costs for path (see costsize.c for more info) */
 	double		rows;			/* estimated number of result tuples */
@@ -1062,7 +1065,6 @@ typedef struct GatherPath
 {
 	Path		path;
 	Path	   *subpath;		/* path for each worker */
-	int			num_workers;	/* number of workers sought to help */
 	bool		single_copy;	/* path must not be executed >1x */
 } GatherPath;
 
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac21a3a..25a7303 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -72,7 +72,7 @@ extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
-			 ParamPathInfo *param_info, int nworkers);
+			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 8fb9eda..4b24da0 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -29,9 +29,12 @@ extern void add_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 				  Cost startup_cost, Cost total_cost,
 				  List *pathkeys, Relids required_outer);
+extern void add_partial_path(RelOptInfo *parent_rel, Path *new_path);
+extern bool add_partial_path_precheck(RelOptInfo *parent_rel,
+						  Cost total_cost, List *pathkeys);
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
-					Relids required_outer, int nworkers);
+					Relids required_outer, int parallel_degree);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
 extern IndexPath *create_index_path(PlannerInfo *root,
@@ -59,19 +62,18 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(RelOptInfo *rel, List *subpaths,
-				   Relids required_outer);
+				   Relids required_outer, int parallel_degree);
 extern MergeAppendPath *create_merge_append_path(PlannerInfo *root,
 						 RelOptInfo *rel,
 						 List *subpaths,
 						 List *pathkeys,
 						 Relids required_outer);
-extern ResultPath *create_result_path(List *quals);
+extern ResultPath *create_result_path(RelOptInfo *rel, List *quals);
 extern MaterialPath *create_material_path(RelOptInfo *rel, Path *subpath);
 extern UniquePath *create_unique_path(PlannerInfo *root, RelOptInfo *rel,
 				   Path *subpath, SpecialJoinInfo *sjinfo);
 extern GatherPath *create_gather_path(PlannerInfo *root,
-				   RelOptInfo *rel, Path *subpath, Relids required_outer,
-				   int nworkers);
+				   RelOptInfo *rel, Path *subpath, Relids required_outer);
 extern Path *create_subqueryscan_path(PlannerInfo *root, RelOptInfo *rel,
 						 List *pathkeys, Relids required_outer);
 extern Path *create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 4e00e9f..940178f 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -50,6 +50,8 @@ extern RelOptInfo *make_one_rel(PlannerInfo *root, List *joinlist);
 extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 					 List *initial_rels);
 
+extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel);
+
 #ifdef OPTIMIZER_DEBUG
 extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
#28Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#27)
Re: parallel joins, and better parallel explain

On Thu, Dec 24, 2015 at 4:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 23, 2015 at 2:34 AM, Dilip Kumar <dilipbalaut@gmail.com>
wrote:

Yeah right, After applying all three patches this problem is fixed, now
parallel hash join is faster than normal hash join.

I have tested one more case which Amit mentioned, I can see in that case
parallel plan (parallel degree>= 3) is still slow, In Normal case it

selects

"Hash Join" but in case of parallel worker > 3 it selects Parallel "Nest
Loop Join" which is making it costlier.

While investigating this problem, I discovered that I can produce a
regression even on unpatched master:

yeah, right..

But this is not entirely the fault of the parallel query code. If you
force a seqscan-over-seqscan plan in the non-parallel cast, it
estimates the join cost as 287772.00, only slightly more than the
261522.02 cost units it thinks a non-parallel hash join will cost. In
fact, however, the non-parallel hash join runs in 1.2 seconds and the
non-parallel nested loop takes 46 seconds.

right..

Updated patch attached.

Thanks for the updated patch...

I have found regression in one more scenario, in hash Join..

Scenario:
--------------
I tried to create a inner table such that, inner table data don't fit in
RAM (I created VM with 1GB Ram).
Purpose of this is to make Disk scan dominant,
and since parallel join is repeating the Disk Scan and hash table building
of inner table, so there will be lot of Parallel I/O and it has to pay
penalty.

I think even though, inner table scanning and hash table building is
parallel, but there will be lot of parallel I/O which will become
bottleneck.

Do we need to consider the cost for parallel i/o also, i am not sure can we
really do that... ?

Note: I have tested with 1GB RAM machine and 8GB RAM machine.
Regression in 8GB RAM machine is less compared to 1GB RAM..

create table t1 (c1 int, c2 int, c3 text);

create table t2 (c1 int, c2 int, c3 text);

insert into t1 values(generate_series(1,100000000),
generate_series(1,100000000), repeat('x', 100));

insert into t2 values(generate_series(1,48000000),
generate_series(1,48000000), repeat('x', 5));

analyze t1;

analyze t2;

Test with: 1GB RAM

-----------------------------

postgres=# set max_parallel_degree=0;

SET

postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1
AND t2.c2 + t1.c1 > 100;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=12248485.55..12248485.56 rows=1 width=0) (actual
time=147490.455..147490.455 rows=1 loops=1)

-> Hash Join (cost=1526963.25..12208485.47 rows=16000033 width=0)
(actual time=26652.871..143368.989 rows=47999950 loops=1)

Hash Cond: (t1.c1 = t2.c1)

Join Filter: ((t2.c2 + t1.c1) > 100)

Rows Removed by Join Filter: 50

-> Seq Scan on t1 (cost=0.00..2742412.72 rows=100005072 width=4)
(actual time=130.580..40127.004 rows=100000000 loops=1)

-> Hash (cost=739461.00..739461.00 rows=48000100 width=8)
(actual time=26500.439..26500.439 rows=48000000 loops=1)

Buckets: 131072 Batches: 1024 Memory Usage: 2856kB

-> Seq Scan on t2 (cost=0.00..739461.00 rows=48000100
width=8) (actual time=0.039..11402.343 rows=48000000 loops=1)

Planning time: 0.410 ms

Execution time: 147490.553 ms

(11 rows)

postgres=# set max_parallel_degree=6;

SET

postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1
AND t2.c2 + t1.c1 > 100;

QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=4969933.98..4969933.99 rows=1 width=0) (actual
time=386024.487..386024.488 rows=1 loops=1)

-> Gather (cost=1527963.25..4929933.89 rows=16000033 width=0) (actual
time=199190.138..379487.861 rows=47999950 loops=1)

Number of Workers: 6

-> Hash Join (cost=1526963.25..3328930.59 rows=16000033 width=0)
(actual time=178885.161..320724.381 rows=6857136 loops=7)

Hash Cond: (t1.c1 = t2.c1)

Join Filter: ((t2.c2 + t1.c1) > 100)

Rows Removed by Join Filter: 7

-> Parallel Seq Scan on t1 (cost=0.00..421909.65
rows=15385396 width=4) (actual time=106.403..11735.643 rows=14285714
loops=7)

-> Hash (cost=739461.00..739461.00 rows=48000100 width=8)
(actual time=177959.433..177959.433 rows=48000000 loops=7)

Buckets: 131072 Batches: 1024 Memory Usage: 2856kB

-> Seq Scan on t2 (cost=0.00..739461.00
rows=48000100 width=8) (actual time=0.022..20778.693 rows=48000000 loops=7)

Planning time: 0.372 ms

Execution time: 386025.056 ms

Test with 8GB RAM:

---------------------------

postgres=# set max_parallel_degree=0;

SET

postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1
AND t2.c2 + t1.c1 > 100;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=12229853.83..12229853.84 rows=1 width=0) (actual
time=111113.286..111113.286 rows=1 loops=1)

-> Hash Join (cost=1526963.25..12189853.75 rows=16000033 width=0)
(actual time=15830.319..108557.658 rows=47999950 loops=1)

Hash Cond: (t1.c1 = t2.c1)

Join Filter: ((t2.c2 + t1.c1) > 100)

Rows Removed by Join Filter: 50

-> Seq Scan on t1 (cost=0.00..2724138.00 rows=100000000 width=4)
(actual time=3.515..43207.798 rows=100000000 loops=1)

-> Hash (cost=739461.00..739461.00 rows=48000100 width=8)
(actual time=15436.088..15436.088 rows=48000000 loops=1)

Buckets: 131072 Batches: 1024 Memory Usage: 2856kB

-> Seq Scan on t2 (cost=0.00..739461.00 rows=48000100
width=8) (actual time=0.677..6290.310 rows=48000000 loops=1)

Planning time: 0.287 ms

Execution time: 111113.358 ms

(11 rows)

postgres=# set max_parallel_degree=6;

SET

postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1
AND t2.c2 + t1.c1 > 100;

QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=6538149.22..6538149.23 rows=1 width=0) (actual
time=172636.184..172636.184 rows=1 loops=1)

-> Gather (cost=1527963.25..6498149.14 rows=16000033 width=0) (actual
time=40952.576..168973.552 rows=47999950 loops=1)

Number of Workers: 6

-> Hash Join (cost=1526963.25..4897145.84 rows=16000033 width=0)
(actual time=41109.818..151129.893 rows=6857136 loops=7)

Hash Cond: (t1.c1 = t2.c1)

Join Filter: ((t2.c2 + t1.c1) > 100)

Rows Removed by Join Filter: 7

-> Parallel Seq Scan on t1 (cost=0.00..1890804.67
rows=16666667 width=4) (actual time=0.492..86241.998 rows=14285714 loops=7)

-> Hash (cost=739461.00..739461.00 rows=48000100 width=8)
(actual time=40936.920..40936.920 rows=48000000 loops=7)

Buckets: 131072 Batches: 1024 Memory Usage: 2856kB

-> Seq Scan on t2 (cost=0.00..739461.00
rows=48000100 width=8) (actual time=0.024..22644.484 rows=48000000 loops=7)

Planning time: 2.668 ms

Execution time: 172636.647 ms

(13 rows)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#29Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#28)
Re: parallel joins, and better parallel explain

On Mon, Jan 4, 2016 at 4:50 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I tried to create a inner table such that, inner table data don't fit in RAM
(I created VM with 1GB Ram).
Purpose of this is to make Disk scan dominant,
and since parallel join is repeating the Disk Scan and hash table building
of inner table, so there will be lot of Parallel I/O and it has to pay
penalty.

I think even though, inner table scanning and hash table building is
parallel, but there will be lot of parallel I/O which will become
bottleneck.

Hmm. Because only 1/1024th of the hash table fits in work_mem, the
executor is going to have to write out all of the tuples that don't
belong to the first batch to a temporary file and then read them back
in. So each backend is going to write essentially the entirety of t2
out to disk and then read it all back in again. The non-parallel case
will also write most of the table contents and then read them back in,
but at least it will only be doing that once rather than 7 times, so
it's not as bad. Also, with fewer backends running, the non-parallel
case will have a bit more memory free for caching purposes.

Do we need to consider the cost for parallel i/o also, i am not sure can we
really do that... ?

It seems to me that the problem here is that you've set
max_parallel_degree to an unrealistically high value. The query
planner is entitled to assume that the user has set
max_parallel_degree to a value which is small enough that the workers
won't be fighting too viciously with each other over resources. It
doesn't really matter whether those resources are CPU resources or I/O
resources. I'm wondering if your 1GB VM really even has as many as 7
vCPUs, because that would seem to be something of an unusual
configuration - and if it doesn't, then setting max_parallel_degree to
a value that high is certainly user error. Even if it does, it's still
not right to set the value as high as six unless the system also has
enough I/O bandwidth to accommodate the amount of I/O that you expect
your queries to generate, and here it seems like it probably doesn't.

To put that another way, you can always make parallel query perform
badly by telling it to use too many workers relative to the size of
the machine you have. This is no different than getting bad query
plans by configuring work_mem or effective_cache_size or any other
query planner GUC to a value that doesn't reflect the actual execution
environment. I would only consider this to be a problem with the
parallel join patch if the chosen plan is slower even on a machine
that's big enough to justify setting max_parallel_degree=6 in the
first place.

While studying this example, I thought about whether we try to fix
this case by generating a partial hash join path only if we expect a
single batch, which would then cause the query planner to plan this
query some other way. But after some thought I don't think that's the
right approach. Multi-batch hash joins are in general quite a lot
slower than single-batch hash joins - and initial_cost_hashjoin knows
that - but if the system has adequate I/O bandwidth, that problem
shouldn't be any worse for a parallel hash join than it is for a
non-parallel hash join. I think the reason you're losing here is
because the system either doesn't have as many vCPUs as the number of
worker processes you are giving it, or it has a very limited amount of
I/O bandwidth that can't handle multiple processes doing sequential
I/O at the same time - e.g. a single spinning disk, or a single SSD
plus a bunch of virtualization overhead. But that need not be the
case. On a system where temporary files are written to a filesystem
backend by an array of disks, you might well get some I/O parallelism.
Of course if we experiment and find that doesn't work out well for
some reason, then we've got a problem, but it doesn't seem implausible
that it might be just fine.

Another interesting question about this particular query is whether a
merge join would have been faster, especially given all Peter
Geoghegan's work to improve sort performance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#29)
Re: parallel joins, and better parallel explain

On Tue, Jan 5, 2016 at 1:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 4, 2016 at 4:50 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I tried to create a inner table such that, inner table data don't fit in

RAM

(I created VM with 1GB Ram).
Purpose of this is to make Disk scan dominant,
and since parallel join is repeating the Disk Scan and hash table

building

of inner table, so there will be lot of Parallel I/O and it has to pay
penalty.

I think even though, inner table scanning and hash table building is
parallel, but there will be lot of parallel I/O which will become
bottleneck.

Hmm. Because only 1/1024th of the hash table fits in work_mem, the
executor is going to have to write out all of the tuples that don't
belong to the first batch to a temporary file and then read them back
in. So each backend is going to write essentially the entirety of t2
out to disk and then read it all back in again. The non-parallel case
will also write most of the table contents and then read them back in,
but at least it will only be doing that once rather than 7 times, so
it's not as bad. Also, with fewer backends running, the non-parallel
case will have a bit more memory free for caching purposes.

Do we need to consider the cost for parallel i/o also, i am not sure can

we

really do that... ?

It seems to me that the problem here is that you've set
max_parallel_degree to an unrealistically high value. The query
planner is entitled to assume that the user has set
max_parallel_degree to a value which is small enough that the workers
won't be fighting too viciously with each other over resources. It
doesn't really matter whether those resources are CPU resources or I/O
resources. I'm wondering if your 1GB VM really even has as many as 7
vCPUs, because that would seem to be something of an unusual
configuration - and if it doesn't, then setting max_parallel_degree to
a value that high is certainly user error. Even if it does, it's still
not right to set the value as high as six unless the system also has
enough I/O bandwidth to accommodate the amount of I/O that you expect
your queries to generate, and here it seems like it probably doesn't.

To put that another way, you can always make parallel query perform
badly by telling it to use too many workers relative to the size of
the machine you have. This is no different than getting bad query
plans by configuring work_mem or effective_cache_size or any other
query planner GUC to a value that doesn't reflect the actual execution
environment. I would only consider this to be a problem with the
parallel join patch if the chosen plan is slower even on a machine
that's big enough to justify setting max_parallel_degree=6 in the
first place.

Yes, it's valid point... I have configured 6Processor for the virtual
machine but that will be with HT.
So this time i have configured 8 processor and taken performance again
with less number of parallel degree.

Even though with less paralllel degree there is some regression, but still
as you mentioned there can be some other limitation like i am configuring
Disk of 50GB and filling 20GB with data.

I think you are right, before coming to any conclusion, we need to test on
really high end machine where machine itself don't have any resource
constraint.

In 1GB RAM
8Processor VM ( Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz) --> This machine
i7, so i doubt it's really using 8 cores, so i tested with less parallel
degree.
SSD: 50GB

postgres=# set max_parallel_degree=3;
postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1
AND t2.c2 + t1.c1 > 100;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=7920946.47..7920946.48 rows=1 width=0) (actual
time=162329.829..162329.829 rows=1 loops=1)
-> Gather (cost=1527963.25..7880946.39 rows=16000033 width=0) (actual
time=58233.106..159140.629 rows=47999950 loops=1)
Number of Workers: 3
-> Hash Join (cost=1526963.25..6279943.09 rows=16000033 width=0)
(actual time=58346.087..144309.987 rows=11999988 loops=4)
Hash Cond: (t1.c1 = t2.c1)
Join Filter: ((t2.c2 + t1.c1) > 100)
Rows Removed by Join Filter: 12
-> Parallel Seq Scan on t1 (cost=0.00..2064959.01
rows=32259701 width=4) (actual time=98.514..27003.566 rows=25000000 loops=4)
-> Hash (cost=739461.00..739461.00 rows=48000100 width=8)
(actual time=58012.228..58012.228 rows=48000000 loops=4)
Buckets: 131072 Batches: 1024 Memory Usage: 2856kB
-> Seq Scan on t2 (cost=0.0po0..739461.00
rows=48000100 width=8) (actual time=3.524..9634.181 rows=48000000 loops=4)
Planning time: 1.945 ms
Execution time: 162330.657 ms

postgres=# set
max_parallel_degree=2;

SET
postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1
AND t2.c2 + t1.c1 > 100;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=8744354.81..8744354.82 rows=1 width=0) (actual
time=133715.245..133715.245 rows=1 loops=1)
-> Gather (cost=1527963.25..8704354.73 rows=16000033 width=0) (actual
time=49240.892..130699.685 rows=47999950 loops=1)
Number of Workers: 2
-> Hash Join (cost=1526963.25..7103351.43 rows=16000033 width=0)
(actual time=48916.074..116934.088 rows=15999983 loops=3)
Hash Cond: (t1.c1 = t2.c1)
Join Filter: ((t2.c2 + t1.c1) > 100)
Rows Removed by Join Filter: 17
-> Parallel Seq Scan on t1 (cost=0.00..2159049.80
rows=41668780 width=4) (actual time=106.882..22650.646 rows=33333333
loops=3)
-> Hash (cost=739461.00..739461.00 rows=48000100 width=8)
(actual time=48670.370..48670.370 rows=48000000 loops=3)
Buckets: 131072 Batches: 1024 Memory Usage: 2856kB
-> Seq Scan on t2 (cost=0.00..739461.00
rows=48000100 width=8) (actual time=0.618..7908.589 rows=48000000 loops=3)
Planning time: 0.380 ms
Execution time: 133715.932 ms
(13 rows)

postgres=# set max_parallel_degree=0;
SET
postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1
AND t2.c2 + t1.c1 > 100;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=12248485.55..12248485.56 rows=1 width=0) (actual
time=92297.234..92297.234 rows=1 loops=1)
-> Hash Join (cost=1526963.25..12208485.47 rows=16000033 width=0)
(actual time=15739.911..89627.652 rows=47999950 loops=1)
Hash Cond: (t1.c1 = t2.c1)
Join Filter: ((t2.c2 + t1.c1) > 100)
Rows Removed by Join Filter: 50
-> Seq Scan on t1 (cost=0.00..2742412.72 rows=100005072 width=4)
(actual time=127.260..24826.175 rows=100000000 loops=1)
-> Hash (cost=739461.00..739461.00 rows=48000100 width=8)
(actual time=15560.002..15560.002 rows=48000000 loops=1)
Buckets: 131072 Batches: 1024 Memory Usage: 2856kB
-> Seq Scan on t2 (cost=0.00..739461.00 rows=48000100
width=8) (actual time=0.834..6199.727 rows=48000000 loops=1)
Planning time: 0.244 ms
Execution time: 92298.000 ms
(11 rows)

One strange behaviour, after increasing number of processor for VM,
max_parallel_degree=0; is also performing better.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#31Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#30)
Re: parallel joins, and better parallel explain

On Mon, Jan 4, 2016 at 8:52 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

One strange behaviour, after increasing number of processor for VM,
max_parallel_degree=0; is also performing better.

So, you went from 6 vCPUs to 8? In general, adding more CPUs means
that there is less contention for CPU time, but if you already had 6
CPUs and nothing else running, I don't know why the backend running
the query would have had a problem getting a whole CPU to itself. If
you previously only had 1 or 2 CPUs then there might have been some
CPU competition with background processes, but if you had 6 then I
don't know why the max_parallel_degree=0 case got faster with 8.
Anyway, I humbly suggest that this query isn't the right place to put
our attention. There's no reason why we can't improve things further
in the future, and if it turns out that lots of people have problems
with the cost estimates on multi-batch parallel hash joins, then we
can revise the cost model. We wouldn't treat a single query where a
non-parallel multi-batch hash join run slower than the costing would
suggest as a reason to revise the cost model for that case, and I
don't think this patch should be held to a higher standard. In this
particular case, you can easily make the problem go away by tuning
configuration parameters, which seems like an acceptable answer for
people who run into this, unless it becomes clear that this particular
problem is widespread and can't be solved without configuration
changes that introduce other issues at the same time.

Keep in mind that, right now, the patch is currently doing just about
the simplest thing possible, and that's working pretty well. Anything
we change at this point is going to be in the direction of adding more
complexity than what I've got right now and more than we've got in the
corresponding non-parallel case. That's OK, but I think it's
appropriate that we only do that if we're pretty sure that those
changes are going to be an improvement. And I think, by and large,
that we don't have enough perspective on this to know that at this
point. Until this gets some wider testing, which probably isn't going
to happen very much until this gets committed, it's hard to say which
problems are just things we're artificially creating in the lab and
which ones are going to be annoyances in the real world. Barring
strenuous objections or discovery of more serious problems with this
than have turned up so far, I'm inclined to go ahead and commit it
fairly soon, so that it attracts some more eyeballs while there's
still a little time left in the development cycle to do something
about whatever the systematic problems turn out to be.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#31)
Re: parallel joins, and better parallel explain

On Wed, Jan 6, 2016 at 10:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 4, 2016 at 8:52 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

One strange behaviour, after increasing number of processor for VM,
max_parallel_degree=0; is also performing better.

So, you went from 6 vCPUs to 8? In general, adding more CPUs means
that there is less contention for CPU time, but if you already had 6
CPUs and nothing else running, I don't know why the backend running
the query would have had a problem getting a whole CPU to itself. If
you previously only had 1 or 2 CPUs then there might have been some
CPU competition with background processes, but if you had 6 then I
don't know why the max_parallel_degree=0 case got faster with 8.

I am really not sure about this case, may be CPU allocation in virtual
machine had problem.. but can't say anything

Anyway, I humbly suggest that this query isn't the right place to put
our attention. There's no reason why we can't improve things further
in the future, and if it turns out that lots of people have problems
with the cost estimates on multi-batch parallel hash joins, then we
can revise the cost model. We wouldn't treat a single query where a
non-parallel multi-batch hash join run slower than the costing would
suggest as a reason to revise the cost model for that case, and I
don't think this patch should be held to a higher standard. In this
particular case, you can easily make the problem go away by tuning
configuration parameters, which seems like an acceptable answer for
people who run into this,

Yes, i agree with this point, cost model can always be improved. And anyway
in most of the queries even in TPC-H benchmark we have seen big improvement
with parallel join.

I have done further testing for observing the plan time, using TPC-H
queries and some other many table join queries(7-8 tables)..

I did not find any visible regression in planning time...

*There are many combinations of queries i have tested, and because of big
size of query and result did not attached in the mail... let me know if
anybody want to know the details of queries...

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com